On the Predictability of Search Trends - Google Revenir à l'accueil

Accéder au pdf :ACCEDER AU PDF

On the Predictability of Search Trends Yair Shimshoni Niv Efron Yossi Matias Google, Israel Labs Draft date: August 17, 2009 1. Introduction Since Google Trends and Google Insights for Search were launched, they provide a daily insight into what the world is searching for on Google, by showing the relative volume of search traffic in Google for any search query. An understanding of web search trends can be useful for advertisers, marketers, economists, scholars, and anyone else interested in knowing more about their world and what's currently top-of-mind. The trends of some search queries are quite seasonal and have repeated patterns. See, for instance, the search trends for ski in the US and in Australia peak during the winter season; or check out how search trends for basketball correlate with annual league events and how consistent it is year-over-year. When looking at trends of the aggregated volume of search queries related to particular categories, one can also observe regular patterns in at least some of hundreds of categories, like the Food & Drink or Automotive categories. Such trends sequences appear quite predictable, and one would naturally expect the patterns of previous years to repeat looking forward. On the other hand, for many other search queries and categories, the trends are quite irregular and hard to predict. For example, the search trends for Obama, Twitter, Android, or global warming, and trends of aggregate searches in the News & Current Events category. Having predictable trends for a search query or for a group of queries could have interesting ramifications. One could forecast the trends into the future, and use it as a "best guess" for various business decisions such as budget planning, marketing campaigns and resource allocations. One could identify deviation from such forecasting and identify new factors that are influencing the search volume like in the detection of influenza epidemics using search queries [Ginsberg etal. 2009] known as Flu Trends. We were therefore interested in the following questions: • How many search queries have trends that are predictable? • Are some categories more predictable than others? How is the distribution of predictable trends between the various categories? • How predictable are the trends of aggregated search queries for different categories? Which categories are more predictable and which are less so? To learn about the predictability of search trends, and so as to overcome our basic limitation of not knowing what the future will entail, we characterize the predictability of a Trends series based on its historical performance. That is, based on the a posteriori predictability of a sequence determined by the discrepancy of forecast trends applied at some point in the past vs the actual performance.Specifically, we have used a simple forecasting model that learns basic seasonality and general trend. For each trends sequence of interest, we take a point in time, t, which is about a year back, compute a one year forecasting for t based on historical data available at time t, and compare it to the actual trends sequence that occurs since time t. The discrepancy between the forecasting trends and the actual trends characterize the predictability level of a sequence, and when the discrepancy is smaller than a predefined threshold, we denote the trends query as predictable. We investigate time series of search trends provided by Google Insights for Search (I4S), which represent query shares of given search terms (or for aggregations of terms). A query share is the total number of queries for a search term (or an entire search category) in a given geographic region divided by the total number of queries in that region at a given point in time. The query share represents the popularity of a query, or the aggregated search interest that users have in a query, and we will therefore use the term search interest interchangeably with query share. The highlights of our observations can be summarized as follows: • Over half of the most popular Google search queries were found predictable in 12 month ahead forecast, with a mean absolute prediction error of approximately 12% on average. • Nearly half of the most popular queries are not predictable, with respect to the prediction model and evaluation framework that we have used. • Some categories have particularly high fraction of predictable queries; for instance, Health (74%), Food & Drink (67%) and Travel (65%). • Some categories have particularly low fraction of predictable queries; for instance, Entertainment (35%) and Social Networks & Online Communities (27%). • The trends of aggregated queries per categories are much more predictable: 88% of the aggregated category search trends of over 600 categories in Insights for Search are predictable with a mean absolute prediction error of less than 6% on average. • There is a clear association between the existence of seasonality patterns and higher predictability as well as an association between high levels of outliers and lower predictability. Recently the research community has started to use Google search data provided publicly by Google Insights for Search (I4S) as auxiliary indicators for economic forecast. [Choi & Varian 2009] have shown that aggregated search trends of Google Categories can be used as extra indicators and effectively leverage several US econometrics prediction models. [Askitas & Zimmermann 2009] and [Suhoy 2009] have shown similar findings on German and Israeli economic data, respectively. Getting a better insight into the behavior of relevant search trends has therefore high potential applicability for these domains. For queries or aggregated set of queries for which the search trends are predictable, one can use a forecasted trends based on the prediction model as a baseline for identifying deviations in actual trends. Such deviations are of particular interest as they are often indicative of material changes in the domain of the queries. We consider a few examples with observed deviation of actual trends relative to the forecasted trends, including: • Automotive Industry We show that in the recent 12 months there is a positive deviation relative to the forecast baseline (i.e., an increased query share) in the searches of Auto Parts and Vehicle Maintenance while there is a negative deviation (i.e., a decrease in query share) in the searches of Vehicle Shopping and Auto Financing.• US Unemployment In relation with the recent research that showed an improvement in prediction of unemployment rates using Google query shares [Choi and Varian 2009 b], we show that the search interest in the category of Welfare & Unemployment has substantially risen in the last year above the forecast based on the prediction model. We also show that the search interest in the category Jobs has significantly decreased according to the prediction model. • Mexico as Vacation Destination We examine the large decrease in the query share of the category Mexico as a vacation destination, compared to the predictions for the last 12 months. We show that a similar deviation of (actual vs. forecast) query share is not observed for other related categories. • Recession Markers We show several examples that demonstrate possible influences of the recent recession on search behavior, like an observed increase of query share for the category Coupons & Rebate compared to the forecast. We also show a negative deviation between the query share for category Restaurants compared to the forecast, where as the category Cooking and Recipes shows a similar positive deviation. Outline The rest of this paper is organized as follows. In Section 2 we formulate the notion of predictability and describe the method of estimating it along with the evaluation measures, prediction model and the time series data that we use. In Section 3 we describe the experiments we conducted and present their results. In Section 4 we examine the association between the predictability of search interest and the level of seasonality or internal deviation of the underlying search trends. Section 5 will present sensitivity analysis and error diagnostics and in section 6 we discuss the potential use of forecasting as a baseline for identifying deviations from regular search behavior; we demonstrate with some examples that the discrepancies from model predictions can act as signals for recent changes in the query share. 2. Time Series Predictability In this section we define the notion of predictability as we use it in our experiments. Predictability. We characterize the predictability of a time series with respect to a prediction model and a discrepancy measure, as follows. Assume we have: • A time series X={xt-H, ... ,xt+F} with history of size H and a future horizon of size F Denote: X H={xt-H, ... ,xt} and X F={xt+1, ... ,xt+F} • A Prediction model: M, which computes a forecast Y=M(X H ) , where Y={yt+1, ... ,yt+F} • A Discrepancy Measure: D=D(X,Y) • A Threshold: D' Then, we say that X is predictable w.r.t (M,D,D',t,H,F) iff D=D(X,Y) < D' . The size of the discrepancy also characterizes the level of predictability of a series. We will often refer to a trends sequence as predictable (or not-predictable) where the various parameters are implied by the context.Data. The time series that are used in the following experiments are based on Google Insights for Search1 (I4S), which reports the query share for search terms in any time, location and category, as well as capable of reporting the most popular queries within a given time / location / category. A query share is defined as the total number of queries for a search term or a set of terms (e.g., an entire search category) in a given geographic region, divided by the total number of queries in that region, at a given point in time. The I4S categories are organized in a tree-like hierarchical structure, with about 30 root level categories, that are further divided into subcategories in a 3-level taxonomy, to a total of about 600 categories and sub-categories. Each search query is classified by I4S to a single category, nevertheless it will also be counted as a part of the query share to all its 'parent' categories. For each category, I4S calculates an aggregated times series which represents the overall query share of this category (i.e., the combined search interest of all the queries in the category). In order to stay focused on the most influential patterns of the yearly seasonality and overall trend (direction), we are using time series of monthly granularity (i.e., one data point per calendar month) and refer to the entire available period (2004-2009). Obviously search trends with finer granularity (e.g., weekly or daily search data) do capture more patterns of search behavior within the intra-monthly and especially the day-of-week effect, however the fine resolution data is also noisier and thus calls for prediction models with higher complexity and a less homogeneous model space. We leave that for future research. We have extracted time series of the entire available time range (2004-2009)2 that consists of 67 data points, which were partitioned into two parts: 1. The History Period - 55 monthly data points (January 2004 - July 2008) 2. The Forecast Period - 12 monthly data points (August 2008 - July 2009) Throughout the work, we will refer to 3 data sets of time series (with a similar format): 1. Country Data - Includes time series of the query shares for the 10,000 most popular queries in each of these countries: USA, UK, Germany, France and Brazil. 2. Category Data - Includes time series of the query shares for the 1,000 most popular queries in the US, for 10 major I4S categories: Automotive, Entertainment, Finance & Insurance, Food & Drink, Health, Social Networks & Online Communities, Real Estate, Shopping, Telecommunications, and Travel. 3. Aggregated Categories Data - Includes time series of aggregated query shares for about 600 I4S categories, which represent the normalized combined search volume in the US for each respective category. Generic Prediction Model. Our prediction process is based on the STL Procedure [Cleveland et.al. 1990], which is a filtering procedure based on locally weighted least squares for decomposing a given time series X into the Trend, Seasonal and Residual components. STL is basically an EM-like algorithm that calculates the seasonal part assuming knowledge of the trend part (iteratively). To compute the forecast of the future values, we extrapolate the trend sub-series using regression, and use the last seasonal period of the seasonal component. 1. URL: http://www.google.com/insights/search/# 2. The time series data was pulled during July 2009, thus the value for this last month might change.The STL procedure uses 6 configuration parameters, 3 of which are smoothing parameters for the three components, which in general should be chosen per time series. The prediction process in our experiments was using a fixed STL configuration for the forecast of all the time series. Given a sampled archive of search time series, we have used an exhaustive exploration and evaluation process that was searching for the best parameter set from a pre-defined set of optional parameter values. The optimality criterion was minimal mean absolute error and the output was a single parameter set w.r.t. the given sampled archive. By choosing to use a particular (fixed) configuration, rather than adjusting an individual parameters set for each given time series, we are adjusting the configuration to a large set of time series thus simplifying the prediction model and enabling much faster forecast. Prediction Discrepancy function. We define the discrepancy D as a combination of several error metrics between the forecasted trends X F and the actual trends Y, as well as seasonal consistency metrics determined by difference in the auto-correlation between X H and X. Specifically, D is defined as a tuple: D = < MAPE, MaxAPE, NMSSE, MeanAbsACFDiff, MaxAbsACFDiff > based on metrics defined below, we say that D . Thus, we say that a given time series is predictable within the available time frame, w.r.t. the prediction model we use and the above error and consistency metrics, if all the following conditions are fulfilled: 1) The Mean Absolute Prediction Error (MAPE) < 25% 2) The Max Absolute Prediction Error (MaxAPE) < 100% 3) The Normalized Mean Sum of Squared Errors (NMSSE) < 10.0 4) The Mean Absolute Difference of the ACF Coef. Sets (MeanAbsACFDiff) < 0.2 5) The Max Absolute Difference of the ACF Coef. Sets (MaxAbsACFDiff) < 0.4 Predictability Ratio. Given a set A of time series, denote its predictability ratio as the number of predictable time series in A, divided by the total number of time series in A.3. Experiments and Results Comparing the Predictability of Top Queries in Different Countries. We have conducted an experiment to test the predictability of search trends regard to the 10,000 most popular search queries in five countries: Country Predictability Ratio Avg. MAPE (for predictable queries) Avg. MaxAPE (for predictable queries) USA 54.1 11.8 27.1 UK 51.4 12.7 32.1 Germany 56.1 11.8 28.2 France 46.9 12.8 28.8 Brazil 46.3 13.7 30.5 Although the above results show some variability among the different countries, one can see that in general, about half of the time series that correspond to popular queries in Google Web Search are predictable with respect to the given prediction model and discrepancy function / threshold. One can see that among the predictable queries, the mean absolute prediction error (MAPE) is about 12% on average, while the maximum absolute prediction error (MaxAPE) is about 30% on average. The Seasonally of Time Series. Time series in general often include various forms of regularity, like a consistent trend (straight, upward or downward) or seasonal patterns (daily, weekly, monthly, etc). In seasonal time series, the amplitude changes along the time in a regular recurring fashion according to the relevant season. In many practical cases, it is common to use a seasonality adjustment where the seasonal component is subtracted from the time series before the analysis, where there are procedures that decompose time series into their seasonal and trend components [Cleveland and Tiao 1976], [Lytras etal. 2007]. We use such a decomposition to compute a metric that represents the relative portion of seasonality within a time series as follows: Given a time series X = {x1, ... ,xT} and a decomposition of X into a seasonal component S and a Trend (i.e., directional) component Tr then: Seasonality Ratio(X) = ( ∑ |Si| ) / ( ∑ |Tri| ) For each time series we forecast, we compute the respective seasonality ratio. For example, let us examine the time series which represents the search interest for the query Cheesecake (in the US, 2004-2009). The blue curve in the following plot shows the original time series which has a significant seasonal component. The red curve is the seasonality adjusted time series; i.e., the trend component that is left after subtracting the seasonality component. It has an upward trend (with Slope:0.18) plus some variability. The seasonality ratio is 2.64 (which is on the 96 percentile of the 10,000 tested queries) and approximates theratio of the area between the red and blue curves, and the area underneath the red curve. The Deviation of a Time Series. In order to assess the extent to which a time series contains extreme values or outliers with large deviation from the overall pattern, we calculate for each time series the deviation ratio. In general, we compute the sum of the top values in the series divided by the total sum of the series, assuming that a large ratio would indicate the existence of considerable extreme values in the series. We normalize by the relative number of top values under consideration. Given a time series X = {x1, ... ,xT} and an integer w, s.t. 1= Prc(X,w) . We use w=90. Notice that the normalization term, (1-(w/100)) in the denominator, is setting the minimal ratio to be 1. Due to the relatively short time series (of 67 points) and since many cases show seasonal patterns with high and narrow peaks (e.g., like in the plot above), it is possible that these sharp peaks will be considered as outliers, although they are a regular part of the time series' recurring dynamics. To mitigate this, we computed the deviation ratio on the seasonal adjusted time series (i.e., on the Trend component that is left after the seasonal component is subtracted by the decomposition we have described above). The Predictability of Search Categories. In order to assess the predictability of categories, we have extracted the 1000 most popular queries in the US for a selection of 10 root level categories and tested their predictability. In the following table, we present the summary results, where the Predictability column on the left refers to the entire 1,000 queries (per category), and the two error metrics (MAPE and MaxAPE in columns 3 & 4) refer only to the sub-set of Predictable queries within each category. The seasonality and deviation ratios are also referring to the entire category sets. The Predictability per category spans from 74% for the Health category, to 27% for the Social Networks & Online Communities category. In the third column we can see that the Mean Absolute Prediction Error (MAPE), varies from of 9% (in the Health category), to 14.1% (in the Social Networks & Online Communities category). The average MAPE for the 10 categories is 12.35%. Notice that the order of predictability ratio is not equal to the order of the MAPE errorsince the Predictability is based on several other metrics as described in Section 2, however the correlation between them is high (r= -0.85). The variability within the columns of seasonality ratio and deviation ratio represents the differences between the search profiles of the various categories, which correspond to the variability of the categories' predictability ratio. For example notice the relatively high seasonality ratio and low deviation ratio of the Food & Drink category which has 66.7% predictability ratio vs. the opposite situation of the Entertainment category that has 35.4% predictability ratio with a relatively low seasonality ratio and high deviation ratio. Category Name Predictability Ratio MAPE predictable queries MaxAPE predictable queries Seasonality Ratio Deviation Ratio Health 74.00 9.00 20.00 0.73 1.58 Food & Drink 66.70 11.90 26.00 1.20 1.74 Travel 64.70 11.80 27.00 1.09 1.61 Shopping 63.30 12.40 28.00 1.21 1.78 Automotive 57.60 11.20 24.90 0.71 1.84 Finance & Insurance 52.90 13.30 30.60 0.65 2.00 Real Estate 49.50 12.90 29.90 0.72 1.82 Telecommunications 45.60 12.90 29.40 0.32 2.34 Entertainment 35.40 14.00 32.30 0.46 2.49 Social Networks 27.50 14.10 30.10 0.19 2.95 For the above summary results of the 10 categories, the correlation between the Predictability and the Seasonality Ratio is r= 0.80 while the Deviation Ratio has a (negative) correlation of r= -0.94 with the Predictability. In the next section we will further examine the association between these regularity characteristics and the predictability. The Predictability of Aggregated Time Series that represent Categories. We now show the results of an experiment of forecasting aggregated times series that represent the overall query share of categories (i.e., the combined search interest of all the queries in the category). We ran the experiment on the aggregated time series of over 600 I4S categories and computed the average absolute prediction error over a period of 12 months ahead. We found 88% of the aggregated category time series to be predictable. The average MAPE for the entire set of aggregated category time series is 8.15%. (6.7% for Predictable queries only), with STD=4.18%. The Average Maximum Prediction Error (MaxAPE) for the entire set was 19.2% (16.6% for Predictable queries only). In the table below, we show the prediction errors for the aggregated time series for the same 10 root categories we examined above. Notice that the prediction errors are now smaller, which was expected. However, we can also see that the order of the categories is not the same as the respective order in the table of the previous experiment. In general, the aggregated time series should have a higher predictability due to the noise reduction effect of the aggregation. The rightmost column shows the MAPE Reduction Rate, which is the relative improvement of the prediction error (MAPE) of the 1,000 queries per category (in the previousexperiment) and the single MAPE for the aggregated category time series here. All categories (except Social Networks & Online Communities) had their MAPE reduced, starting from 47% improvement for the Finance & Insurance category up to 85% for the Food & Drink Category. Category MAPEMaxAPESeasonality Ratio Deviation Ratio MAPE Reduction Rate Food & Drink 1.76 4.52 0.70 1.18 0.85 Shopping 2.72 6.02 2.77 1.11 0.78 Entertainment 2.74 5.95 0.30 1.16 0.80 Health 2.99 7.69 1.04 1.11 0.67 Automotive 3.27 7.36 1.69 1.12 0.71 Travel 3.94 7.61 1.92 1.12 0.67 Telecommunications 5.2 9.07 0.74 1.20 0.60 Real Estate 5.62 12.8 2.95 1.11 0.56 Finance & Insurance 7.08 17.8 0.61 1.26 0.47 Social Networks 38.6 50.4 0.06 2.46 -1.74 The I4S classification into search categories is based on a hierarchical tree-like taxonomy where each category at the root level of the tree has several sub-categories under it. Thus, a combination of all the categories' prediction error into an overall evaluation of the prediction error, can consist of the average MAPE values of the 27 root level categories. However, a 'regular' (uniform) average which gives the same weight to each category, might be inaccurate. Therefore, we have computed a weighted average of the root categories' MAPE, where the weights are the overall relative search interest of each root category. The MAPE Weighted Average is 4.25%. The following table shows the predication errors for the I4S root categories (sorted by the MAPE): Root Category MAPE MaxAPE Food & Drink 1.76 4.52 Beauty & Personal Care 2.2 7.41 Home & Garden 2.21 4.9 Photo & Video 2.34 8.31 Lifestyles 2.38 5.27 Games 2.59 4.45 Shopping 2.72 6.02 Entertainment 2.74 5.95 Business 2.91 11.5 Health 2.99 7.69 Local 3.24 5.49 Automotive 3.27 7.36 Reference 3.7 8.2 Industries 3.77 7.14 Recreation 3.81 7.58 Computers & Electronics 3.93 7.83Travel 3.94 7.61 Internet 4.87 15 Telecommunications 5.2 9.07 Society 5.57 12.6 Real Estate 5.62 12.8 Sports 5.81 29.3 Arts & Humanities 6.98 11.8 Finance & Insurance 7.08 17.8 Science 10.1 15.5 News & Current Events 16.6 47 Social Networks 38.6 50.4 Average 5.81 12.5 Comparing the Predictability of a Category and its Sub-Categories. It is reasonable to expect that a time series of the aggregated search of a set of queries should in general be more predictable than single queries. The larger the aggregation set is, the smaller would be the variability of the aggregated time series. This has implications on the predictability of categories vs. sub-categories, but also has implications regarding aggregated time series of group of queries such as campaign related queries or brand/topic related queries in general. In order to demonstrate this, we have explored the MAPE and MaxAPE prediction errors of the I4S category Vehicle Brands (in the Automotive category), compared to all its 31 'children' sub-categories. The variability of Prediction Errors (MAPE) within the 31 vehicle brands sub-categories is substantial and varies from 3% to 38%. The average MAPE of the 31 brands is 11.4% (with STD=7.7%) which is quite similar to the average MAPE for the 1,000 most popular queries in the Automotive category (11.2%) as we presented above. As expected, the average MAPE of the 31 sub-categories is larger than the MAPE of the aggregated time series of the Vehicle Brands category which is only 3.39%. We have also calculated the median MAPE (9.3%), as well as the weighted average MAPE (with relative search interest per category as weights) (9.7%). Both the median and the weighted average are lower than the regular average but still much larger than the MAPE for the overall aggregated category of Vehicle Brands.4. Predictability vs. Seasonality and Deviation Ratios Among the 10 categories for which we have analyzed their 1,000 most popular queries, we calculated a correlation of r= 0.80 between the Predictability and the Seasonality Ratio and r= -0.94 between the Predictability and the Deviation Ratio (see table in Section 3). Below, we examine the association between the these two time series' characteristics and the MAPE prediction error in the experiment we conducted on the 10,000 most popular queries in the US. Seasonality and Prediction Errors. Many patterns of search behavior have a strong seasonal component (e.g. holidays shopping, summer vacation, etc.) as implied from the specific market they are in. Occasionally, there is also a directional trend effect (up, down or changing) which might be less visually pronounced due to the confounding seasonal pattern. We have used the Seasonality Ratio (described above) as a representation for the 'level of seasonality' of the queries. Among the 10,000 most popular queries in the US, the Seasonality Ratio varies in the rather large range [0.01,13], from time series with no seasonal component up to extremely seasonal time series. The median Seasonality Ratio is 0.4 and its mean value is 0.8. We could see no significant correlation between the prediction error and the seasonality ratio. In order to visualize this possible association, we have sorted the values of seasonality ratio and created a ('smoothed') arrays of 10 average points3 . Similarly, we have computed a 'smoothed' array of averages for the 10,000 corresponding MAPE prediction errors which were sorted according to the corresponding seasonality ratio. We show here a scatter plot of the 'smoothed' MAPE vs the 'smoothed' seasonality ratio. The plot shows a non-stable 'negative' association between prediction errors and the seasonality. The correlation coefficient between the 'smoothed' arrays is substantial (r=0.55), compared to the insignificant correlation we saw for the entire set. 3. Given a time series {YN}, N=10,000 ; K=10; M=N/K=1,000. We compute an array A={A1, A2,.....,AK} of the averages of K consecutive non-overlapping windows of size M over the time series {YN}, such that Ak= (1/M) ∑Yi, where k=1,..,K and i={1+(k-1)M,..,kM}.For the next plot we have repeated the same process - but for predictable time series only. The result shows a stronger 'negative' association between the MAPE prediction error and the seasonality ratio for Predictable queries. Deviation Ratio and Prediction Errors. The Deviation Ratio, which represents the level of outliers and irregular extreme values in a time series was found to be associated with the Predictability of the search interest time series. For the 10,000 queries we tested, the average deviation ratio was 2.08 (STD=1.9). Only 5% of the Predictable time series had a deviation ratio in the upper quartile and 73% of the predictable time series had a deviation ratio under the median. The correlation coefficient between the deviation ratio and the the MAPE error was r=0.29. The average deviation ratio for the Predictable time series was: 1.50 where as for the non-Predictable queries the average was: 2.77. We have applied the same process as above in order to visually demonstrate the association between MAPE and the deviation ratio. The following plot shows a clear positive association between the (sorted) 'smoothed' array of the deviation Ratio and the corresponding 'smoothed' array of prediction errors (MAPE). The correlation coefficient calculated for the 'smoothed' arrays was r=0.88 (compared to r=0.29 which was computed with the original values). Hence, we can say that the larger the deviation level in the time series, the larger is the prediction error. This can also be seen in the next plot for the the Predictable queries only.5. Sensitivity Analysis and Errors Diagnostics Sensitivity of the Predictability Thresholds. As described earlier, we have chosen a predefined set of thresholds which correspond to the three prediction error metrics (MAPE, MaxAPE, NMSSE) and two consistency metrics. These thresholds are responsible for the trade-off between the Predictability Ratio and the distribution of errors within the Predictable time series. In the following figure we see a sensitivity plot for the Mean Absolute Prediction Error (MAPE), that shows how the Predictability Ratio behaves as a function of the Predictability Threshold. We present a separate analysis for each error measure and not as a conjunction of all the conditions as appears in our Predictability definition. The following plot shows that choosing a Predictability Threshold [MAPE<0.25] 'qualifies' more than 60% of the queries (for a single metric condition). Raising the MAPE threshold by 100% into 0.5, would imply that the Predication Ratio would rise by ~30% (using only the MAPE error metric). Raising the MAPE threshold even more, by 200% into 0.75, would imply that the Predication Ratio would rise by ~50% and will qualify approximately 90% of the queries.The next plots are the sensitivity plots for the MaxAPE and NMSSE error metrics. We can see that both chosen Predictability thresholds (1.0, 10.0) are located much farther into the "Predictable Region" and qualify almost 90% of the queries. Thus, in our experiments we use the MAPE as our primary 'filter' where the MaxAPE and the NMSSE play a secondary role. The following plot displays a similar presentation by showing the number of Predicable time series as a function of the Predictability Threshold (using only the MAPE error measure).Prediction Errors Diagnostics. In this section we show diagnostics plots for the US data (top 10,000 queries). The following figure shows the actual values vs. the predicted values (in log scale), for each of the 12 months in the Forecast Period. The top 12 diagrams refer to the entire set of queries, followed by 12 diagrams for the Predictable queries only. One can clearly see the better prediction performance for the Predictable queries (at the bottom part) as expected. Notice that the performance for the different months deteriorates with time (higher average and STD of the prediction errors) especially towards the later months.In order to learn more on the distribution of the average and maximum prediction errors within the top 10,000 most popular queries in the US, we present the histogram of the MAPE and MaxAPE error measures, with the density estimation superimposed (in red). We can see that both distributions are positively skewed and that the value of the average error is largely affected by the extreme error values. Notice that we have trimmed the data at 0.75 and 3.0 for MAPE and MaxAPE respectively (i.e., 3 x the chosen thresholds), to stay focused on the major part of the distribution.Comparison of the Forecast Performance along the Future Horizon. Since in our experiments we are simultaneously predicting 12 month ahead, it is expected that the forecasts for the later months may have larger prediction errors. We have compared the prediction performance for the 12 consecutive month in the forecast period. The following plot shows the distribution of MAPE prediction errors for each future month. We are showing the average monthly MAPE for the Predictable queries only (among the 10,000 most popular in the US). Notice that the first month is predicted in greater accuracy than the rest, then there is an approximately constant error level for months 2-9, with some increase of the error rate in the last 3 months in the Forecast period. The following plot shows the same type of diagram, but for the Mean Prediction Error (i.e., the 'directional' error measure with the sign). We can learn from this plot that there was a positive bias (upward) in the predictions along all months except the 11'th month. Such systematic tendency of the errors can be explained by a reduction of query share for many queries in the Forecast period (Aug 2008 - July 2009) due to the global economic crisis. Hence the actual search interest values were lower than expected by the prediction model that was based on the previous years. In the following section we present examples of categories (and queries) regarding various markets and brands, for which the actual monthly query shares for the recent 12 months are different than model prediction.6. Search Interest Forecasting as baseline for identifying deviations The aggregated query share of the Google Insights for Search (I4S) categories were used in a recent work of Choi and Varian (2009), that showed how data taken from Google I4S could help to predict economic time series. For example, in the analysis on the US Retail Trade they have used the weekly aggregated time series of categories like: Automotive, Computers & Electronics, Apparel, Sporting Goods, Mass Shopping, Merchants & Department Stores, etc. In a later work [Choi and Varian 2009 b] have applied the same methodology on the U.S. unemployment time series using two sub-categories, Jobs and Welfare & Unemployment. They did not attempt to forecast the Google query share; rather, they have successfully used it as predictors for external economic time series. Other works have shown similar results, regarding the capability of aggregated categories' query share to predict econometrics and unemployment data from Germany [Askitas and Zimmerman 2009] as well as from Israel [Suhoy 2009]. In the following, we will show time series of monthly query share of categories, where the forecast values (in red) were superimposed on the actual values (in blue). The errors made by the prediction model are expressing the deviation between the expected and the actual search behavior, which conveys a valuable information regarding the current state of search interest in the respective categories. Choi and Varian have shown that the users' search interest in several categories as represented by the aggregated query shares indeed have a short term predictive power regarding the actual underlying. The following plots show the aggregated time series of various categories that relate to some major US markets. These category plots, which are ordered by their average MAPE, vary in their Predictability level. From the 10 category plots, we can see that many present a clear seasonal pattern. The first 7 time series showed a relatively low error rate (MAPE<6%), which is in accordance with the substantial regularity of search behavior of the respective categories that was maintained throughout the Forecast period. However, notice that the category of Finance & Insurance which shows a seasonal patterns with some medium irregularities (the seasonality ratio is well above its median), underwent a considerable change in the recent 12 months, highlighting an observed discrepancies between the predicted and the actual monthly search interest. The months of September-October 2008 which were low months in each year during the entire history period are observed as peak month in the Forecast period. This is an example where the prediction model could not anticipate the unexpected exogenous events. The category of Energy and Utility showed the most irregular search behavior (with the lowest Seasonality Ratio and the highest Deviation Ratio among the first 9 categories). In addition to the low regularity of its history, it seems that this category has also underwent a change in the dynamics of search interest, probably since mid year 2008. These contributed to the low prediction results for this category. Another good example for lack of Predictability w.r.t. the prediction model, is the last plot of the Social Networks & Online Communities category that has shown a considerable exponential growth in the forecast period (due to the growing popularity of social networks like Facebook and Twitter), which could not be captured by the prediction model (notice the high deviationratio). We will show below several other examples of the relation between the prediction performance and the external market events.Next, we show several examples where one can use the (posterior) prediction results in order to explore the changing dynamics of users' search behavior and possibly get insights on the relevant markets. Whenever we observe substantial prediction errors, i.e., discrepancies between the actual values vs the predicted values, we can conclude that the regularities in the time series (e.g., seasonality and trend) which were captured by the prediction model, were disturbed in the Forecast period. In cases where the actual values show a regularity that is not in accordance with the history's regularity, one could investigate the reasons for such deviation with relation to known external factors. It is important to emphasize that users' search interest is not necessarily always related to consumer preferences, buying intentions, etc. and can be related sometimes to news or or other associated events. A full discussion on the background and reasons for the following market observations is beyond the scope of this paper. Example: The Automotive Industry. We can see that the forecast for the entire Vehicle brands category for the 12 month period between Aug-08 and Jul-094 shows a relatively low prediction error rate of -2.3% on average. However, as we show below there are some noticeable deviations in different sub-categories. We can see in the next 4 plots that the category Vehicle Shopping shows an average negative deviation of 6% from the prediction model in the last 12 months and that the category Auto 4. The time series data was pulled during July 2009, thus the value for this last month is partial and might be biased.Financing is showing a small negative deviation with average of 2.3% respectively. Notice that both categories of Vehicle Maintenance and Auto Parts are showing a positive average deviation of 4.3% and 5.2% respectively, compared to the predictions.Example: US Unemployment. Choi and Varian (2009 b) have used weekly time series of the I4S aggregated categories Welfare & unemployment and Jobs, to help in short term prediction of "Initial Jobless Claims” reports which are issued by US Department of Labor. In the following plots, we show that the search interest the category Welfare & Unemployment has risen substantially above the forecast by the prediction model. The deviation of Welfare and Unemployment is systematic and relatively quite large. While the average MAPE for the entire set of (aggregated) categories' query shares is 8.1%, with STD 8.2%, the MAPE for Welfare & Unemployment is 31.2% which is 2.8 standard deviations above the overall average MAPE. The actual monthly values for the aggregated query share of the category Jobs are also all higher than forecasted by the model. The time series shows a seasonal pattern with a distinguishable low value in December each year and a relatively constant level in between. At the end of the History period and throughout the Forecast, this regularity is shifted upwards by a confounding volatile factor, which causes large positive prediction errors. The Average Error is almost 9% per month.We present here also the aggregated query share of the category Recruitment & Staffing, for which we can observe a corresponding negative deviation where the model expectations are larger than the actual search interest values. Interestingly, despite a similar seasonal pattern as in the Jobs category, it seems that the change in the users' search behavior in this category has not started until March 2009. Beforehand the predictions were rather accurate and the average monthly deviation is therefore only about (-4.8%). Example: Mexico as Vacation Destination. In this example we show that the search interest for Mexico as a vacation destination has decreased substantially in the recent months. The I4S category Mexico is a sub-category of the Vacation Destinations category (in the Travel root category) which aggregates only the vacation related searches on Mexico. In the next plots we can see that the search interest in the category Mexico is down by almost 15% compared to the predicted. In comparison, we show the respective deviation in the entire category of Vacation Destinations, which is only -1.6% on average in the same forecast period. Notice for a reference that the search interest of another related vacation destination, the Caribbean Islands (with a similar seasonal pattern), also has not shown a deviation of similar magnitude (only -2.5%).We considered the recent outbreak of the Swine Flu pandemic that started to spread in April 2009 as a possible contributor for such a negative deviation of actual-vs-forecast query share for Mexico. We examined the time series of the query share for H1N1 and found it to be highly (anti) correlated (r = -0.93) with the observed deviations for Mexico. As a reference, we show the aggregated query share for the category Infectious Diseases, demonstrating the magnitude of the search interest in this subject (in blue) that was spiking following the Swine Flu outbreak:Example: Recession Markers. The following plots present the aggregated query share for some I4S sub-categories in subjects that might demonstrate the influence of the recent recession on search behavior of consumers, and often appear in articles and blog posts. The change in search interest for the category Coupons & Rebates is visible in the following plot, where one can see an average monthly deviation of 15.9% between the observed query share in the recent 12 month compared to the values predicted by the model. The model has captured the general seasonal pattern, however only accounted for a lower holidays peak and a much more moderate upward trend. Next we see the observed query share of the I4S category Restaurants, that is systematically lower than the model predictions. The time series for the aggregated search interests in this category does not show a seasonal pattern, however there exist an upward trend since 2004, which was apparently broken at September 2008 hence causing negative actual-vs-forecast deviation with a an average of -7.8% per month.Below we can see for reference that the Cooking & Recipes category has a systematic positive deviation of actual-vs-forecast query share. The average monthly deviation of 6.15% represents a higher observed search interest in this category for the entire Forecast period compared to model prediction, with almost a constant deviation since January 2009. Another example is the category Gifts, for which the query share has decreased in the recent 12 months compared to the model predictions, by 11% per month on average. Below we can also see that the category Luxury Goods is showing a negative deviation in the actual-vsforecast query share, of 5.8% per month on average.7. Conclusions We studied the predictability of search trends. We found that over half of the most popular Google search queries are predictable w.r.t. the method we have selected, and that several search categories were considerably more predictable than others; that the aggregated queries of the different categories are more predictable than the individual queries and that almost 90% of I4S categories have predictable query shares. In particular we showed that queries with seasonal time series and lower levels of outliers are more predictable. We considered forecasting as a baseline for identification of deviation of actual-vs-forecast, and considered some concrete examples for situations from the automotive, travel and labor verticals. Further research can include an improved implementation of the prediction model as well as incorporating other forecasting models. We would also like to examine short-term forecasting in finer time granularity. Further analysis on actual-vs-forecast (including confidence estimation) could be conducted in various domains, like market analysis, economy, health, etc. In conjunction with this study, a basic forecasting capability was introduced into Google Insights For Search, which provides forecasting for trends that are identified as predictable. Researchers, marketers, journalists, and others, can use I4S to get a wide picture on search trends which now also includes predictability of single queries and aggregated categories in any area of interest. Acknowledgments We would like to thank Yannai Gonczarowsky for designing and implementing the forecasting capabilities in I4S as well as Nir Andelman, Yuval Netzer and Amit Weinstein for creating the forecasting model library. We thank Hal Varian for his helpful comments. Special thanks to the entire team of Google Insights for Search that made this research possible.References [Askitas and Zimmerman 2009] Nikos Askitas and Kalus F. Zimmerman. Google econometrics and unemployment forecasting. Applied Economics Quarterly, 55:107;120, 2009. URL http://ftp.iza.org/dp4201.pdf [Choi and Varian 2009] Hyunyoung Choi and Hal Varian. Predicting the present with google trends. Technical report, Google, 2009. URL http://google.com/googleblogs/pdfs/google_predicting_the_present.pdf. [Choi and Varian 2009b] Hyunyoung Choi and Hal Varian. Predicting Initial Claims for Unemployment Insurance Using Google Trends. Tech. Report, Google, 2009. URL http://research.google.com/archive/papers/initialclaimsUS.pdf [Cleveland and Tiao 1976] W.P. Cleveland and G.C. Tiao. Decomposition of Seasonal Time Series: A Model for the Census X-11 Program, Journal of the American Statistical Association, Vol. 71, No. 355, 1976 pp. 581-587. [Cleveland etal. 1990] R.B Cleveland, W.S. Cleveland, J.E. McRae and Irma Terpenning. STL: A Seasonal-Trend Decomposition Procedure Based on Loess. Jou. of Official Stat., VOL. 6, No. 1, 1990 pp. 3-73. [Ginsberg etal. 2009] Jeremy Ginsberg, Matthew H. Mohebbi, Rajan S. Patel, Lynnette Brammer, Mark S. Smolinski & Larry Brilliant. Detecting influenza epidemics using search engine query data. Nature 457, 1012-1014 (2009). URL http://www.nature.com/nature/journal/v457/n7232/full/nature07634.html [Lytras etal. 2007] Demerta P. Lytras, Roxanne M. Felpausch, and William R. Bell. Determining Seasonality: A Comparison of Diagnostics From X-12-ARIMA (Presented at ICES III, June, 2007). [Suhoy 2009] Tanya Suhoy. Query indices and a 2008 downturn: Israeli data. Tech. Report, Bank of Israel, 2009. URL http://www.bankisrael.gov.il/deptdata/mehkar/papers/dp0906e.pdf Building High-level Features Using Large Scale Unsupervised Learning Quoc V. Le quocle@cs.stanford.edu Marc’Aurelio Ranzato ranzato@google.com Rajat Monga rajatmonga@google.com Matthieu Devin mdevin@google.com Kai Chen kaichen@google.com Greg S. Corrado gcorrado@google.com Jeff Dean jeff@google.com Andrew Y. Ng ang@cs.stanford.edu Abstract We consider the problem of building highlevel, class-specific feature detectors from only unlabeled data. For example, is it possible to learn a face detector using only unlabeled images? To answer this, we train a 9- layered locally connected sparse autoencoder with pooling and local contrast normalization on a large dataset of images (the model has 1 billion connections, the dataset has 10 million 200x200 pixel images downloaded from the Internet). We train this network using model parallelism and asynchronous SGD on a cluster with 1,000 machines (16,000 cores) for three days. Contrary to what appears to be a widely-held intuition, our experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not. Control experiments show that this feature detector is robust not only to translation but also to scaling and out-of-plane rotation. We also find that the same network is sensitive to other high-level concepts such as cat faces and human bodies. Starting with these learned features, we trained our network to obtain 15.8% accuracy in recognizing 22,000 object categories from ImageNet, a leap of 70% relative improvement over the previous state-of-the-art. Appearing in Proceedings of the 29 th International Conference on Machine Learning, Edinburgh, Scotland, UK, 2012. Copyright 2012 by the author(s)/owner(s). 1. Introduction The focus of this work is to build high-level, classspecific feature detectors from unlabeled images. For instance, we would like to understand if it is possible to build a face detector from only unlabeled images. This approach is inspired by the neuroscientific conjecture that there exist highly class-specific neurons in the human brain, generally and informally known as “grandmother neurons.” The extent of class-specificity of neurons in the brain is an area of active investigation, but current experimental evidence suggests the possibility that some neurons in the temporal cortex are highly selective for object categories such as faces or hands (Desimone et al., 1984), and perhaps even specific people (Quiroga et al., 2005). Contemporary computer vision methodology typically emphasizes the role of labeled data to obtain these class-specific feature detectors. For example, to build a face detector, one needs a large collection of images labeled as containing faces, often with a bounding box around the face. The need for large labeled sets poses a significant challenge for problems where labeled data are rare. Although approaches that make use of inexpensive unlabeled data are often preferred, they have not been shown to work well for building high-level features. This work investigates the feasibility of building highlevel features from only unlabeled data. A positive answer to this question will give rise to two significant results. Practically, this provides an inexpensive way to develop features from unlabeled data. But perhaps more importantly, it answers an intriguing question as to whether the specificity of the “grandmother neuron” could possibly be learned from unlabeled data. Informally, this would suggest that it is at least in principle possible that a baby learns to group faces into one classBuilding high-level features using large-scale unsupervised learning because it has seen many of them and not because it is guided by supervision or rewards. Unsupervised feature learning and deep learning have emerged as methodologies in machine learning for building features from unlabeled data. Using unlabeled data in the wild to learn features is the key idea behind the self-taught learning framework (Raina et al., 2007). Successful feature learning algorithms and their applications can be found in recent literature using a variety of approaches such as RBMs (Hinton et al., 2006), autoencoders (Hinton & Salakhutdinov, 2006; Bengio et al., 2007), sparse coding (Lee et al., 2007) and K-means (Coates et al., 2011). So far, most of these algorithms have only succeeded in learning lowlevel features such as “edge” or “blob” detectors. Going beyond such simple features and capturing complex invariances is the topic of this work. Recent studies observe that it is quite time intensive to train deep learning algorithms to yield state of the art results (Ciresan et al., 2010). We conjecture that the long training time is partially responsible for the lack of high-level features reported in the literature. For instance, researchers typically reduce the sizes of datasets and models in order to train networks in a practical amount of time, and these reductions undermine the learning of high-level features. We address this problem by scaling up the core components involved in training deep networks: the dataset, the model, and the computational resources. First, we use a large dataset generated by sampling random frames from random YouTube videos.1 Our input data are 200x200 images, much larger than typical 32x32 images used in deep learning and unsupervised feature learning (Krizhevsky, 2009; Ciresan et al., 2010; Le et al., 2010; Coates et al., 2011). Our model, a deep autoencoder with pooling and local contrast normalization, is scaled to these large images by using a large computer cluster. To support parallelism on this cluster, we use the idea of local receptive fields, e.g., (Raina et al., 2009; Le et al., 2010; 2011b). This idea reduces communication costs between machines and thus allows model parallelism (parameters are distributed across machines). Asynchronous SGD is employed to support data parallelism. The model was trained in a distributed fashion on a cluster with 1,000 machines (16,000 cores) for three days. Experimental results using classification and visualization confirm that it is indeed possible to build highlevel features from unlabeled data. In particular, using a hold-out test set consisting of faces and distractors, we discover a feature that is highly selective for faces. 1This is different from the work of (Lee et al., 2009) who trained their model on images from one class. This result is also validated by visualization via numerical optimization. Control experiments show that the learned detector is not only invariant to translation but also to out-of-plane rotation and scaling. Similar experiments reveal the network also learns the concepts of cat faces and human bodies. The learned representations are also discriminative. Using the learned features, we obtain significant leaps in object recognition with ImageNet. For instance, on ImageNet with 22,000 categories, we achieved 15.8% accuracy, a relative improvement of 70% over the stateof-the-art. Note that, random guess achieves less than 0.005% accuracy for this dataset. 2. Training set construction Our training dataset is constructed by sampling frames from 10 million YouTube videos. To avoid duplicates, each video contributes only one image to the dataset. Each example is a color image with 200x200 pixels. A subset of training images is shown in Appendix A. To check the proportion of faces in the dataset, we run an OpenCV face detector on 60x60 randomly-sampled patches from the dataset (http://opencv.willowgarage.com/wiki/). This experiment shows that patches, being detected as faces by the OpenCV face detector, account for less than 3% of the 100,000 sampled patches 3. Algorithm In this section, we describe the algorithm that we use to learn features from the unlabeled training set. 3.1. Previous work Our work is inspired by recent successful algorithms in unsupervised feature learning and deep learning (Hinton et al., 2006; Bengio et al., 2007; Ranzato et al., 2007; Lee et al., 2007). It is strongly influenced by the work of (Olshausen & Field, 1996) on sparse coding. According to their study, sparse coding can be trained on unlabeled natural images to yield receptive fields akin to V1 simple cells (Hubel & Wiesel, 1959). One shortcoming of early approaches such as sparse coding (Olshausen & Field, 1996) is that their architectures are shallow and typically capture low-level concepts (e.g., edge “Gabor” filters) and simple invariances. Addressing this issue is a focus of recent work in deep learning (Hinton et al., 2006; Bengio et al., 2007; Bengio & LeCun, 2007; Lee et al., 2008; 2009) which build hierarchies of feature representations. In particular, Lee et al (2008) show that stacked sparse RBMs can model certain simple functions of the V2 area ofBuilding high-level features using large-scale unsupervised learning the cortex. They also demonstrate that convolutional DBNs (Lee et al., 2009), trained on aligned images of faces, can learn a face detector. This result is interesting, but unfortunately requires a certain degree of supervision during dataset construction: their training images (i.e., Caltech 101 images) are aligned, homogeneous and belong to one selected category. Figure 1. The architecture and parameters in one layer of our network. The overall network replicates this structure three times. For simplicity, the images are in 1D. 3.2. Architecture Our algorithm is built upon these ideas and can be viewed as a sparse deep autoencoder with three important ingredients: local receptive fields, pooling and local contrast normalization. First, to scale the autoencoder to large images, we use a simple idea known as local receptive fields (LeCun et al., 1998; Raina et al., 2009; Lee et al., 2009; Le et al., 2010). This biologically inspired idea proposes that each feature in the autoencoder can connect only to a small region of the lower layer. Next, to achieve invariance to local deformations, we employ local L2 pooling (Hyv¨arinen et al., 2009; Gregor & LeCun, 2010; Le et al., 2010) and local contrast normalization (Jarrett et al., 2009). L2 pooling, in particular, allows the learning of invariant features (Hyv¨arinen et al., 2009; Le et al., 2010). Our deep autoencoder is constructed by replicating three times the same stage composed of local filtering, local pooling and local contrast normalization. The output of one stage is the input to the next one and the overall model can be interpreted as a nine-layered network (see Figure 1). The first and second sublayers are often known as filtering (or simple) and pooling (or complex) respectively. The third sublayer performs local subtractive and divisive normalization and it is inspired by biological and computational models (Pinto et al., 2008; Lyu & Simoncelli, 2008; Jarrett et al., 2009).2 As mentioned above, central to our approach is the use of local connectivity between neurons. In our experiments, the first sublayer has receptive fields of 18x18 pixels and the second sub-layer pools over 5x5 overlapping neighborhoods of features (i.e., pooling size). The neurons in the first sublayer connect to pixels in all input channels (or maps) whereas the neurons in the second sublayer connect to pixels of only one channel (or map).3 While the first sublayer outputs linear filter responses, the pooling layer outputs the square root of the sum of the squares of its inputs, and therefore, it is known as L2 pooling. Our style of stacking a series of uniform modules, switching between selectivity and tolerance layers, is reminiscent of Neocognition and HMAX (Fukushima & Miyake, 1982; LeCun et al., 1998; Riesenhuber & Poggio, 1999). It has also been argued to be an architecture employed by the brain (DiCarlo et al., 2012). Although we use local receptive fields, they are not convolutional: the parameters are not shared across different locations in the image. This is a stark difference between our approach and previous work (LeCun et al., 1998; Jarrett et al., 2009; Lee et al., 2009). In addition to being more biologically plausible, unshared weights allow the learning of more invariances other than translational invariances (Le et al., 2010). In terms of scale, our network is perhaps one of the largest known networks to date. It has 1 billion trainable parameters, which is more than an order of magnitude larger than other large networks reported in literature, e.g., (Ciresan et al., 2010; Sermanet & LeCun, 2011) with around 10 million parameters. It is worth noting that our network is still tiny compared to the human visual cortex, which is 106 times larger in terms of the number of neurons and synapses (Pakkenberg et al., 2003). 3.3. Learning and Optimization Learning: During learning, the parameters of the second sublayers (H) are fixed to uniform weights, 2The subtractive normalization removes the weighted average of neighboring neurons from the current neuron gi,j,k = hi,j,k − P iuv Guvhi,j+u,i+v The divisive normalization computes yi,j,k = gi,j,k/ max{c,( P iuv Guvg 2 i,j+u,i+v) 0.5 }, where c is set to be a small number, 0.01, to prevent numerical errors. G is a Gaussian weighting window. (Jarrett et al., 2009) 3For more details regarding connectivity patterns and parameter sensitivity, see Appendix B and E.Building high-level features using large-scale unsupervised learning whereas the encoding weights W1 and decoding weights W2 of the first sublayers are adjusted using the following optimization problem minimize W1,W2 Xm i=1  W2WT 1 x (i) − x (i) 2 2 + λ Xk j=1 q ǫ + Hj (WT 1 x(i)) 2  . (1) Here, λ is a tradeoff parameter between sparsity and reconstruction; m, k are the number of examples and pooling units in a layer respectively; Hj is the vector of weights of the j-th pooling unit. In our experiments, we set λ = 0.1. This optimization problem is also known as reconstruction Topographic Independent Component Analysis (Hyv¨arinen et al., 2009; Le et al., 2011a).4 The first term in the objective ensures the representations encode important information about the data, i.e., they can reconstruct input data; whereas the second term encourages pooling features to group similar features together to achieve invariances. Optimization: All parameters in our model were trained jointly with the objective being the sum of the objectives of the three layers. To train the model, we implemented model parallelism by distributing the local weights W1, W2 and H to different machines. A single instance of the model partitions the neurons and weights out across 169 machines (where each machine had 16 CPU cores). A set of machines that collectively make up a single copy of the model is referred to as a “model replica.” We have built a software framework called DistBelief that manages all the necessary communication between the different machines within a model replica, so that users of the framework merely need to write the desired upwards and downwards computation functions for the neurons in the model, and don’t have to deal with the low-level communication of data across machines. We further scaled up the training by implementing asynchronous SGD using multiple replicas of the core model. For the experiments described here, we divided the training into 5 portions and ran a copy of the model on each of these portions. The models communicate updates through a set of centralized “parameter servers,” which keep the current state of all parameters for the model in a set of partitioned servers (we used 256 parameter server partitions for training the model described in this paper). In the simplest 4 In (Bengio et al., 2007; Le et al., 2011a), the encoding weights and the decoding weights are tied: W1 = W2. However, for better parallelism and better features, our implementation does not enforce tied weights. implementation, before processing each mini-batch a model replica asks the centralized parameter servers for an updated copy of its model parameters. It then processes a mini-batch to compute a parameter gradient, and sends the parameter gradients to the appropriate parameter servers, which then apply each gradient to the current value of the model parameter. We can reduce the communication overhead by having each model replica request updated parameters every P steps and by sending updated gradient values to the parameter servers every G steps (where G might not be equal to P). Our DistBelief software framework automatically manages the transfer of parameters and gradients between the model partitions and the parameter servers, freeing implementors of the layer functions from having to deal with these issues. Asynchronous SGD is more robust to failure and slowness than standard (synchronous) SGD. Specifically, for synchronous SGD, if one of the machines is slow, the entire training process is delayed; whereas for asynchronous SGD, if one machine is slow, only one copy of SGD is delayed while the rest of the optimization can still proceed. In our training, at every step of SGD, the gradient is computed on a minibatch of 100 examples. We trained the network on a cluster with 1,000 machines for three days. See Appendix B, C, and D for more details regarding our implementation of the optimization. 4. Experiments on Faces In this section, we describe our analysis of the learned representations in recognizing faces (“the face detector”) and present control experiments to understand invariance properties of the face detector. Results for other concepts are presented in the next section. 4.1. Test set The test set consists of 37,000 images sampled from two datasets: Labeled Faces In the Wild dataset (Huang et al., 2007) and ImageNet dataset (Deng et al., 2009). There are 13,026 faces sampled from non-aligned Labeled Faces in The Wild.5 The rest are distractor objects randomly sampled from ImageNet. These images are resized to fit the visible areas of the top neurons. Some example images are shown in Appendix A. 4.2. Experimental protocols After training, we used this test set to measure the performance of each neuron in classifying faces against distractors. For each neuron, we found its maximum 5http://vis-www.cs.umass.edu/lfw/lfw.tgzBuilding high-level features using large-scale unsupervised learning and minimum activation values, then picked 20 equally spaced thresholds in between. The reported accuracy is the best classification accuracy among 20 thresholds. 4.3. Recognition Surprisingly, the best neuron in the network performs very well in recognizing faces, despite the fact that no supervisory signals were given during training. The best neuron in the network achieves 81.7% accuracy in detecting faces. There are 13,026 faces in the test set, so guessing all negative only achieves 64.8%. The best neuron in a one-layered network only achieves 71% accuracy while best linear filter, selected among 100,000 filters sampled randomly from the training set, only achieves 74%. To understand their contribution, we removed the local contrast normalization sublayers and trained the network again. Results show that the accuracy of best neuron drops to 78.5%. This agrees with previous study showing the importance of local contrast normalization (Jarrett et al., 2009). We visualize histograms of activation values for face images and random images in Figure 2. It can be seen, even with exclusively unlabeled data, the neuron learns to differentiate between faces and random distractors. Specifically, when we give a face as an input image, the neuron tends to output value larger than the threshold, 0. In contrast, if we give a random image as an input image, the neuron tends to output value less than 0. Figure 2. Histograms of faces (red) vs. no faces (blue). The test set is subsampled such that the ratio between faces and no faces is one. 4.4. Visualization In this section, we will present two visualization techniques to verify if the optimal stimulus of the neuron is indeed a face. The first method is visualizing the most responsive stimuli in the test set. Since the test set is large, this method can reliably detect near optimal stimuli of the tested neuron. The second approach is to perform numerical optimization to find the optimal stimulus (Berkes & Wiskott, 2005; Erhan et al., 2009; Le et al., 2010). In particular, we find the normbounded input x which maximizes the output f of the tested neuron, by solving: x ∗ = arg min x f(x; W, H), subject to ||x||2 = 1. Here, f(x; W, H) is the output of the tested neuron given learned parameters W, H and input x. In our experiments, this constraint optimization problem is solved by projected gradient descent with line search. These visualization methods have complementary strengths and weaknesses. For instance, visualizing the most responsive stimuli may suffer from fitting to noise. On the other hand, the numerical optimization approach can be susceptible to local minima. Results, shown in Figure 3, confirm that the tested neuron indeed learns the concept of faces. Figure 3. Top: Top 48 stimuli of the best neuron from the test set. Bottom: The optimal stimulus according to numerical constraint optimization. 4.5. Invariance properties We would like to assess the robustness of the face detector against common object transformations, e.g., translation, scaling and out-of-plane rotation. First, we chose a set of 10 face images and perform distortions to them, e.g., scaling and translating. For outof-plane rotation, we used 10 images of faces rotating in 3D (“out-of-plane”) as the test set. To check the robustness of the neuron, we plot its averaged response over the small test set with respect to changes in scale, 3D rotation (Figure 4), and translation (Figure 5).6 6Scaled, translated faces are generated by standard cubic interpolation. For 3D rotated faces, we used 10 se-Building high-level features using large-scale unsupervised learning Figure 4. Scale (left) and out-of-plane (3D) rotation (right) invariance properties of the best feature. Figure 5. Translational invariance properties of the best feature. x-axis is in pixels The results show that the neuron is robust against complex and difficult-to-hard-wire invariances such as out-of-plane rotation and scaling. Control experiments on dataset without faces: As reported above, the best neuron achieves 81.7% accuracy in classifying faces against random distractors. What if we remove all images that have faces from the training set? We performed the control experiment by running a face detector in OpenCV and removing those training images that contain at least one face. The recognition accuracy of the best neuron dropped to 72.5% which is as low as simple linear filters reported in section 4.3. 5. Cat and human body detectors Having achieved a face-sensitive neuron, we would like to understand if the network is also able to detect other high-level concepts. For instance, cats and body parts are quite common in YouTube. Did the network also learn these concepts? To answer this question and quantify selectivity properties of the network with respect to these concepts, we constructed two datasets, one for classifying human bodies against random backgrounds and one for classifying cat faces against other random distractors. For the ease of interpretation, these datasets have a positive-to-negative ratio identical to the face dataset. The cat face images are collected from the dataset dequences of rotated faces from The Sheffield Face Database – http://www.sheffield.ac.uk/eee/research/iel/research/face. See Appendix F for a sample sequence. Figure 6. Visualization of the cat face neuron (left) and human body neuron (right). scribed in (Zhang et al., 2008). In this dataset, there are 10,000 positive images and 18,409 negative images (so that the positive-to-negative ratio is similar to the case of faces). The negative images are chosen randomly from the ImageNet dataset. Negative and positive examples in our human body dataset are subsampled at random from a benchmark dataset (Keller et al., 2009). In the original dataset, each example is a pair of stereo black-and-white images. But for simplicity, we keep only the left images. In total, like in the case of human faces, we have 13,026 positive and 23,974 negative examples. We then followed the same experimental protocols as before. The results, shown in Figure 6, confirm that the network learns not only the concept of faces but also the concepts of cat faces and human bodies. Our high-level detectors also outperform standard baselines in terms of recognition rates, achieving 74.8% and 76.7% on cat and human body respectively. In comparison, best linear filters (sampled from the training set) only achieve 67.2% and 68.1% respectively. In Table 1, we summarize all previous numerical results comparing the best neurons against other baselines such as linear filters and random guesses. To understand the effects of training, we also measure the performance of best neurons in the same network at random initialization. We also compare our method against several other algorithms such as deep autoencoders (Hinton & Salakhutdinov, 2006; Bengio et al., 2007) and K-means (Coates et al., 2011). Results of these baselines are reported in the bottom of Table 1. 6. Object recognition with ImageNet We applied the feature learning method to the task of recognizing objects in the ImageNet dataset (Deng et al., 2009). We started from a network that already learned features from YouTube and ImageNet images using the techniques described in this paper. We then added one-versus-all logistic classifiers on top of the highest layer of this network. This method of initializing a network by unsupervisedBuilding high-level features using large-scale unsupervised learning Table 1. Summary of numerical comparisons between our algorithm against other baselines. Top: Our algorithm vs. simple baselines. Here, the first three columns are results for methods that do not require training: random guess, random weights (of the network at initialization, without any training) and best linear filters selected from 100,000 examples sampled from the training set. The last three columns are results for methods that have training: the best neuron in the first layer, the best neuron in the highest layer after training, the best neuron in the network when the contrast normalization layers are removed. Bottom: Our algorithm vs. autoencoders and K-means. Concept Random Same architecture Best Best first Best Best neuron without guess with random weights linear filter layer neuron neuron contrast normalization Faces 64.8% 67.0% 74.0% 71.0% 81.7% 78.5% Human bodies 64.8% 66.5% 68.1% 67.2% 76.8% 71.8% Cats 64.8% 66.0% 67.8% 67.1% 74.6% 69.3% Concept Our Deep autoencoders Deep autoencoders K-means on network 3 layers 6 layers 40x40 images Faces 81.7% 72.3% 70.9% 72.5% Human bodies 76.7% 71.2% 69.8% 69.3% Cats 74.8% 67.5% 68.3% 68.5% Table 2. Summary of classification accuracies for our method and other state-of-the-art baselines on ImageNet. Dataset version 2009 (∼9M images, ∼10K categories) 2011 (∼14M images, ∼22K categories) State-of-the-art 16.7% (Sanchez & Perronnin, 2011) 9.3% (Weston et al., 2011) Our method 16.1% (without unsupervised pretraining) 13.6% (without unsupervised pretraining) 19.2% (with unsupervised pretraining) 15.8% (with unsupervised pretraining) learning is also known as “unsupervised pretraining.” During supervised learning with labeled ImageNet images, the parameters of lower layers and the logistic classifiers were both adjusted. This was done by first adjusting the logistic classifiers and then adjusting the entire network (also known as “fine-tuning”). As a control experiment, we also train a network starting with all random weights (i.e., without unsupervised pretraining: all parameters are initialized randomly and only adjusted by ImageNet labeled data). We followed the experimental protocols specified by (Deng et al., 2010; Sanchez & Perronnin, 2011), in which, the datasets are randomly split into two halves for training and validation. We report the performance on the validation set and compare against state-of-theart baselines in Table 2. Note that the splits are not identical to previous work but validation set performances vary slightly across different splits. The results show that our method, starting from scratch (i.e., raw pixels), bests many state-of-the-art hand-engineered features. On ImageNet with 10K categories, our method yielded a 15% relative improvement over previous best published result. On ImageNet with 22K categories, it achieved a 70% relative improvement over the highest other result of which we are aware (including unpublished results known to the authors of (Weston et al., 2011)). Note, random guess achieves less than 0.005% accuracy for this dataset. 7. Conclusion In this work, we simulated high-level class-specific neurons using unlabeled data. We achieved this by combining ideas from recently developed algorithms to learn invariances from unlabeled data. Our implementation scales to a cluster with thousands of machines thanks to model parallelism and asynchronous SGD. Our work shows that it is possible to train neurons to be selective for high-level concepts using entirely unlabeled data. In our experiments, we obtained neurons that function as detectors for faces, human bodies, and cat faces by training on random frames of YouTube videos. These neurons naturally capture complex invariances such as out-of-plane and scale invariances. The learned representations also work well for discriminative tasks. Starting from these representations, we obtain 15.8% accuracy for object recognition on ImageNet with 20,000 categories, a significant leap of 70% relative improvement over the state-of-the-art. Acknowledgements: We thank Samy Bengio, Adam Coates, Tom Dean, Jia Deng, Mark Mao, Peter Norvig, Paul Tucker, Andrew Saxe, and Jon Shlens for helpful discussions and suggestions. References Bengio, Y. and LeCun, Y. Scaling learning algorithms towards AI. In Large-Scale Kernel Machines, 2007. Bengio, Y., Lamblin, P., Popovici, D., and Larochelle, H. Greedy layerwise training of deep networks. In NIPS, 2007. Berkes, P. and Wiskott, L. Slow feature analysis yields a rich repertoire of complex cell properties. Journal of Vision, 2005. Ciresan, D. C., Meier, U., Gambardella, L. M., andBuilding high-level features using large-scale unsupervised learning Schmidhuber, J. Deep big simple neural nets excel on handwritten digit recognition. CoRR, 2010. Coates, A., Lee, H., and Ng, A. Y. An analysis of singlelayer networks in unsupervised feature learning. In AISTATS 14, 2011. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and FeiFei, L. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009. Deng, J., Berg, A., Li, K., and Fei-Fei, L. What does classifying more than 10,000 image categories tell us? In ECCV, 2010. Desimone, R., Albright, T., Gross, C., and Bruce, C. Stimulus-selective properties of inferior temporal neurons in the macaque. The Journal of Neuroscience, 1984. DiCarlo, J. J., Zoccolan, D., and Rust, N. C. How does the brain solve visual object recognition? Neuron, 2012. Erhan, D., Bengio, Y., Courville, A., and Vincent, P. Visualizing higher-layer features of deep networks. Technical report, University of Montreal, 2009. Fukushima, K. and Miyake, S. Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position. Pattern Recognition, 1982. Gregor, K. and LeCun, Y. Emergence of complex-like cells in a temporal product network with local receptive fields. arXiv:1006.0448, 2010. Hinton, G. E. and Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science, 2006. Hinton, G. E., Osindero, S., and Teh, Y. W. A fast learning algorithm for deep belief nets. Neural Computation, 2006. Huang, G. B., Ramesh, M., Berg, T., and Learned-Miller, E. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 07-49, University of Massachusetts, Amherst, October 2007. Hubel, D. H. and Wiesel, T.N. Receptive fields of single neurons in the the cat’s visual cortex. Journal of Physiology, 1959. Hyv¨arinen, A., Hurri, J., and Hoyer, P. O. Natural Image Statistics. Springer, 2009. Jarrett, K., Kavukcuoglu, K., Ranzato, M.A., and LeCun, Y. What is the best multi-stage architecture for object recognition? In ICCV, 2009. Keller, C., Enzweiler, M., and Gavrila, D. M. A new benchmark for stereo-based pedestrian detection. In Proc. of the IEEE Intelligent Vehicles Symposium, 2009. Krizhevsky, A. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. Le, Q. V., Ngiam, J., Chen, Z., Chia, D., Koh, P. W., and Ng, A. Y. Tiled convolutional neural networks. In NIPS, 2010. Le, Q. V., Karpenko, A., Ngiam, J., and Ng, A. Y. ICA with Reconstruction Cost for Efficient Overcomplete Feature Learning. In NIPS, 2011a. Le, Q.V., Ngiam, J., Coates, A., Lahiri, A., Prochnow, B., and Ng, A.Y. On optimization methods for deep learning. In ICML, 2011b. LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient based learning applied to document recognition. Proceeding of the IEEE, 1998. Lee, H., Battle, A., Raina, R., and Ng, Andrew Y. Efficient sparse coding algorithms. In NIPS, 2007. Lee, H., Ekanadham, C., and Ng, A. Y. Sparse deep belief net model for visual area V2. In NIPS, 2008. Lee, H., Grosse, R., Ranganath, R., and Ng, A.Y. Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations. In ICML, 2009. Lyu, S. and Simoncelli, E. P. Nonlinear image representation using divisive normalization. In CVPR, 2008. Olshausen, B. and Field, D. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 1996. Pakkenberg, B., P., D., Marner, L., Bundgaard, M. J., Gundersen, H. J. G., Nyengaard, J. R., and Regeur, L. Aging and the human neocortex. Experimental Gerontology, 2003. Pinto, N., Cox, D. D., and DiCarlo, J. J. Why is real-world visual object recognition hard? PLoS Computational Biology, 2008. Quiroga, R. Q., Reddy, L., Kreiman, G., Koch, C., and Fried, I. Invariant visual representation by single neurons in the human brain. Nature, 2005. Raina, R., Battle, A., Lee, H., Packer, B., and Ng, A.Y. Self-taught learning: Transfer learning from unlabelled data. In ICML, 2007. Raina, R., Madhavan, A., and Ng, A. Y. Large-scale deep unsupervised learning using graphics processors. In ICML, 2009. Ranzato, M., Huang, F. J, Boureau, Y., and LeCun, Y. Unsupervised learning of invariant feature hierarchies with applications to object recognition. In CVPR, 2007. Riesenhuber, M. and Poggio, T. Hierarchical models of object recognition in cortex. Nature Neuroscience, 1999. Sanchez, J. and Perronnin, F. High-dimensional signature compression for large-scale image-classification. In CVPR, 2011. Sermanet, P. and LeCun, Y. Traffic sign recognition with multiscale convolutional neural networks. In IJCNN, 2011. Weston, J., Bengio, S., and Usunier, N. Wsabie: Scaling up to large vocabulary image annotation. In IJCAI, 2011. Zhang, W., Sun, J., and Tang, X. Cat head detection - how to effectively exploit shape and texture features. In ECCV, 2008.Building high-level features using large-scale unsupervised learning A. Training and test images A subset of training images is shown in Figure 7. As can be seen, the positions, scales, orientations of faces in the dataset are diverse. A subset of test images for Figure 7. Thirty randomly-selected training images (shown before the whitening step). identifying the face neuron is shown in Figure 8. Figure 8. Some example test set images (shown before the whitening step). B. Models Central to our approach in this paper is the use of locally-connected networks. In these networks, neurons only connect to a local region of the layer below. In Figure 9, we show the connectivity patterns of the neural network architecture described in the paper. The actual images in the experiments are 2D, but for simplicity, our images in the visualization are in 1D. Figure 9. Diagram of the network we used with more detailed connectivity patterns. Color arrows mean that weights only connect to only one map. Dark arrows mean that weights connect to all maps. Pooling neurons only connect to one map whereas simple neurons and LCN neurons connect to all maps. C. Model Parallelism We use model parallelism to distribute the storage of parameters and gradient computations to different machines. In Figure 10, we show how the weights are divided and stored in different “partitions,” or more simply, machines (see also (Krizhevsky, 2009)). D. Further multicore parallelism Machines in our cluster have many cores which allow further parallelism. Hence, we split these cores to perform different tasks. In our implementation, the cores are divided into three groups: reading data, sending (or writing) data, and performing arithmetic computations. At every time instance, these groups work in parallel to load data, compute numerical results and send to network or write data to disks. E. Parameter sensitivity The hyper-parameters of the network are chosen to fit computational constraints and optimize the training time of our algorithm. These parameters can be changed at the expense of longer training time or more computational resources. For instance, one could increase the size of the receptive fields at an expense of using more memory, more computation, and more net-Building high-level features using large-scale unsupervised learning Figure 10. Model parallelism with the network architecture in use. Here, it can be seen that the weights are divided according to the locality of the image and stored on different machines. Concretely, the weights that connect to the left side of the image are stored in machine 1 (“partition 1”). The weights that connect to the central part of the image are stored in machine 2 (“partition 2”). The weights that connect to the right side of the image are stored in machine 3 (“partition 3”). work bandwidth per machine; or one could increase the number of maps at an expense of using more machines and memories. These hyper-parameters also could affect the performance of the features. We performed control experiments to understand the effects of the two hyperparameters: the size of the receptive fields and the number of maps. By varying each of these parameters and observing the test set accuracies, we can gain an understanding of how much they affect the performance on the face recognition task. Results, shown in Figure 11, confirm that the results are only slightly sensitive to changes in these control parameters. 12 14 16 18 20 60 65 70 75 80 85 receptive field size test set accuracy 6 7 8 9 10 60 65 70 75 80 85 number of maps test set accuracy Figure 11. Left: effects of receptive field sizes on the test set accuracy. Right: effects of number of maps on the test set accuracy. F. Example out-of-plane rotated face sequence In Figure 12, we show an example sequence of 3D (out-of-plane) rotated faces. Note that the faces are black and white but treated as a color picture in the test. More details are available at the webpage for The Sheffield Face Database dataset – http://www.sheffield.ac.uk/eee/research/ iel/research/face Figure 12. A sequence of 3D (out-of-plane) rotated face of one individual. The dataset consists of 10 sequences. G. Best linear filters In the paper, we performed control experiments to compare our features against “best linear filters.” This baseline works as follows. The first step is to sample 100,000 random patches (or filters) from the training set (each patch has the size of a test set image). Then for each patch, we compute its cosine distances between itself and the test set images. The cosine distances are treated as the feature values. Using these feature values, we then search among 20 thresholds to find the best accuracy of a patch in classifying faces against distractors. Each patch gives one accuracy for our test set. The reported accuracy is the best accuracy among 100,000 patches randomly-selected from the training set. H. Histograms on the entire test set Here, we also show the detailed histograms for the neurons on the entire test sets. The fact that the histograms are distinctive for positive and negative images suggests that the network has learned the concept detectors.Building high-level features using large-scale unsupervised learning Figure 13. Histograms of neuron’s activation values for the best face neuron on the test set. Red: the histogram for face images. Blue: the histogram for random distractors. Figure 14. Histograms for the best human body neuron on the test set. Red: the histogram for human body images. Blue: the histogram for random distractors. I. Most responsive stimuli for cats and human bodies In Figure 16, we show the most responsive stimuli for cat and human body neurons on the test sets. Note that, the top stimuli for the human body neuron are black and white images because the test set images are black and white (Keller et al., 2009). J. Implementation details for autoencoders and K-means In our implementation, deep autoencoders are also locally connected and use sigmoidal activation function. For K-means, we downsample images to 40x40 in order to lower computational costs. We also varied the parameters of autoencoders, K-means and chose them to maximize performances given resource constraints. In our experiments, we used 30,000 centroids for Kmeans. These models also employed parallelism in a similar fashion described in the paper. They also used 1,000 machines for three days. Figure 15. Histograms for the best cat neuron on the test set. Red: the histogram for cat images. Blue: the histogram for random distractors. Figure 16. Top: most responsive stimuli on the test set for the cat neuron. Bottom: Most responsive human body stimuli on the test set for the human body neuron. On-Demand Language Model Interpolation for Mobile Speech Input Brandon Ballinger1, Cyril Allauzen2, Alexander Gruenstein1, Johan Schalkwyk2 1Google, 1600 Amphitheatre Parkway, Mountain View, CA 94043, USA 2Google, 76 Ninth Avenue, New York, NY 10011, USA brandonb@google.com, allauzen@google.com, alexgru@google.com, johans@google.com Abstract Google offers several speech features on the Android mobile operating system: search by voice, voice input to any text field, and an API for application developers. As a result, our speech recognition service must support a wide range of usage scenarios and speaking styles: relatively short search queries, addresses, business names, dictated SMS and e-mail messages, and a long tail of spoken input to any of the applications users may install. We present a method of on-demand language model interpolation in which contextual information about each utterance determines interpolation weights among a number of n-gram language models. On-demand interpolation results in an 11.2% relative reduction in WER compared to using a single language model to handle all traffic. Index Terms: language modeling, interpolation, mobile 1. Introduction Entering text on mobile devices is often slow and error-prone in comparison to typing on a full-sized keyboard. Google offers several features on Android aimed at making speech a viable alternative input method: search by voice, voice input into any text field, and a speech API for application developers. To search by voice, users simply tap a microphone icon on the desktop search box, or hold down the physical search button. They can speak any query, and are then shown the Google search results. To use the Voice Input feature, users tap the microphone key on the on-screen keyboard, and then speak to enter text virtually anywhere they would normally type. Users may dictate e-mail and SMS messages, fill in forms on web pages, or enter text into any application. Finally, the Android Speech API is a simple way for developers to integrate speech recognition capabilities into their own applications. While a large portion of usage of the speech recognition service is comprised of spoken queries and dictation of SMS messages, there is a long tail of usage from thousands of other applications. Due to this diversity, choosing an appropriate language model for each utterance (recorded audio) is challenging. Two viable options are to build a single language model to handle all traffic, or to train a language model appropriate to each major use case and then choose the “best” one for each utterance, depending on the context of that utterance. We develop and compare a third option in this paper, in which a development set of utterances from each context is used to optimize interpolation weights among a small number of component language models. Since there may be thousands of such “contexts”, the language models are interpolated ondemand, either during decoding or as a post-processing rescoring phase. On-demand interpolation is performed efficiently via the use of a “compact interpolated” finite state transducer (FST), in which transition weights are dynamically computed. Percent of utterances Voice input 49% Search by Voice 44% Speech API 7% Table 1: Breakdown of speech traffic on Android devices that support Voice Input, Search by Voice, and Speech API. 2. Related Work The technique of creating interpolated language models for different contexts has been used with success in a number of conversational interfaces [1, 2, 3] In this case, the pertinent context is the system’s “dialogue state”, and it’s typical to group transcribed utterances by dialogue state and build one language model per state. Typically, states with little data are merged, and the state-specific language models are interpolated, or otherwise merged. Language models corresponding to multiple states may also be interpolated, to share information across similar states. The technique we develop here differs in two key respects. First, we derive interpolation weights for thousands of recognition contexts, rather than a handful of dialogue states. This makes it impractical to create each interpolated language model offline and swap in the desired one at runtime. Our language models are large, and we only learn the recognition context for a particular utterance when the audio starts to arrive. Second, rather than relying on transcribed utterances from each recognition context to train state-specific language modes, we instead interpolate a small number of language models trained from large corpora. 3. Android Speech Usage Analysis The challenge of supporting a variety of use cases is illustrated by examining the usage of the speech features available on Android. Table 1 breaks down the portion of utterances from the Android platform associated with the three speech features: voice input, search by voice, and the speech API. We note that this distinction isn’t perfect, as some users might, for example, speak a search query into a text box in the browser using the voice input feature. In addition, a large majority of the speech API utterances come from built-in Google applications – Google Maps provides a popular voice-enabled search box, for example. Overall, we observe roughly an even split between searching and dictation. The voice input feature encourages a wide range of usage. Since its launch in January, 2010, users have dictated text into over 8,000 distinct text fields. Table 2 shows the 10 most popular text fields. SMS is extremely popular, with usage levels an order of magnitude greater than any other application. Moreover, among the top 10 fields, 4 of them come from either the built-in SMS application, or one of the many SMS applicaCopyright © 2010 ISCA 26-30 September 2010, Makuhari, Chiba, Japan INTERSPEECH 2010 1812Text Field Usage SMS - Compose 63.1% An SMS app from Market - Compose 4.9% Browser 4.8% Google Talk 4.5% Gmail - Compose 3.3% Android Market - Search 2.4% Email - Compose 1.8% SMS - To 1.3% Maps - Directions Endpoint 1.0% An SMS app from Market - Compose 1.0% Table 2: The 10 most popular voice input text fields and their percent usage. 0 10 20 30 40 50 60 70 80 90 100 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Number of fields, sorted by usage Cumulative percent of utterances Figure 1: Cumulative usage for the most popular 100 text fields, rank ordered by usage. tions available on the Android Market. Also popular are other dictation-style applications: Gmail, Email, and Google Talk. Android Market and Maps, both of which also appear in the top 10, represent different kinds of utterances – search queries. Finally, the Browser category here actually encompasses a wide range of fields – any text field on any web page. Figure 1 shows the cumulative usage per text field of the 100 most popular text fields, rank ordered by usage. Although the usage is certainly concentrated among a handful of applications, there remains a significant tail. While increasing accuracy for the tail may not have a huge effect on the overall accuracy of the system, it’s important for users to have a seamless experience using voice input: users will have a difficult time discerning that voice input may work better in some text fields than others. 4. Compact Interpolated FST In this setting, we have a relatively small set of language models that is fixed and known in advance. At recognition time, each utterance comes with a custom set of interpolation (or mixture) weights and we need to be able to efficiently compute ondemand the corresponding interpolated model. In a backoff language model, the conditional probability of w ∈ Σ given context h ∈ Σ∗ is recursively defined as P(w | h) = P(w | h) if hw ∈ S αhP(w | h ) otherwise, where P is the adjusted maximum likelihood probability (derived from the training corpus), S is the skeleton of the model, αh is the backoff weight for the context h and h is the longest common suffix of h. The order of the model is maxhw∈S |hw|. Such a language model can naturally be represented by a weighted automaton over the real semiring (R, +, ×, 0, 1) using failure transitions [4]: the set of states is Q = x y φ/.5 xa a/.5 ya a/.4 yb b/.4 yc c/.04 x y φ/.4 xb b/.6 ya a/.6 yb b/.2 yc c/.02 x y φ/(.5,.4) a/(.5,.24) xa xb b/(.2,.6) a/(.4,.6) ya yb b/(.4,.2) c/(.04,.02) yc (a) (b) (c) Figure 2: Outgoing transitions from state x in (a) G1, (b) G2 and (c) I. For λ = (.6, .4)T , PIλ (a | x) = .6 × .5 + .4 × .24. {h ∈ Σ∗ | ∃w ∈ Σ such that hw ∈ S}, for each state h, there is a failure transition from h to h labeled by φ and with weight αh, and for each hw ∈ S, there is a transition from h to the longest suffix of hw that belongs to Q, labeled by w and with weight P(w | h). Given a set G = {G1,...,Gm} of m backoff language models and a vector of mixture weights λ = (λ1,...λm) T , the linear interpolation of G by λ is defined as the language model Iλ assigning the conditional probability: PIλ (w | h) = m i=1 λiPGi (w | h). (1) Using (1) directly to perform on-demand interpolation would be inefficient because for a given pair (w, h) we might need to backoff several times in several of the models and this can become rather expensive when using the automata representation. Instead, we chose to reformulate the interpolated model as a backoff model: PIλ (w | h) = λT phw if hw ∈ S(G), f(λ, αh)PIλ (w | h ) otherwise, where phw = (PG1 (w|h),..., PGm(w|h))T , S(G) = ∪m i=1S(Gi) and αh = (αh(G1),...,αh(Gm))T . There exists a closed-form expression of f(λ, α) that ensure the proper normalization of the model. However, in practice we decided to approximate it by the dot product of λ and αh: f(λ, αh) = λT αh. The benefit of this formulation is that it perfectly fits our requirement. Since the set of models is known in advance we can precompute S(G) and all the relevant vectors (phw and αh) effectively building a generic interpolated model I as a model over Rm. Given a new utterance and a corresponding vector of mixture weights λ, we can obtain the relevant interpolated model Iλ by taking the dot product of each component vector of I with λ. Moreover, this approach also allows for an efficient representation of I as a weighted automaton over the semiring (Rm, +, ◦, 0, 1) (◦ denotes componentwise multiplication), the weight of each transition in the automaton being a vector in Rm. The set of states is Q = {h ∈ Σ∗ | ∃w ∈ Σ such that hw ∈ S(G)}. For each state h, there is a failure transition from h to h labeled by φ and with weight αh, and for each hw ∈ S(G), there is a transition from h to the longest suffix of hw that belongs to Q, labeled by w and with weight phw. Figure 2 illustrates this construction. Given a new utterance and a corresponding vector of mixture weights λ, this automaton can be converted on-demand into a weighted automaton over the real semiring by taking the dot product of λ and the weight vector of each visited transition. 1813 ReFr: An Open-Source Reranker Framework Daniel M. Bikel, Keith B. Hall Google Research, New York, NY {dbikel,kbhall}@google.com Abstract ReFr (http://refr.googlecode.com) is a software architecture for specifying, training and using reranking models, which take the n-best output of some existing system and produce new scores for each of the n hypotheses that potentially induce a different ranking, ideally yielding better results than the original system. The Reranker Framework has some special support for building discriminative language models, but can be applied to any reranking problem. The framework is designed with parallelism and scalability in mind, being able to run on any Hadoop cluster out of the box. While extremely efficient, ReFr is also quite flexible, allowing researchers to explore a wide variety of features and learning methods. ReFr has been used for building state-of-the-art discriminative LM’s for both speech recognition and machine translation systems. Index Terms: language modeling, discriminative language modeling, reranking, structured prediction 1. Introduction Creating effective software tools for research is a tricky business. The classic tension between flexibility and efficiency arises with greater urgency. We want researchers to be able to try out many different ideas easily, but we also want them to be able to have a quick code-test-evaluate cycle. ReFr grew out of the 2011 Johns Hopkins Summer Workshop, from the team using automatically generated confusions to synthesize training data for discriminative language models for speech and machine translation, led by Prof. Brian Roark of OHSU. That approach required tools that would scale up to training data sizes orders of magnitude larger than had previously been used to build discriminative language models, so we not only needed our training and inference to be inherently fast, but we needed to design tools with distributed computing in mind from the outset. This paper describes the tools we have developed to solve not only the immediate research problem of exploring confusions for discriminative language modeling, but also the more general problem of reranking approaches to speech and language processing, including structured prediction. We designed ReFr to have the following properites: • “library quality” code • industrial strength • academic flexibility • easy exploration of different types of features, different update methods (e.g., MIRA-style, direct loss minimization, loss-sensitive) and different learning methods (e.g., perceptron-style, log-linear, kernel methods) • modern, object-oriented design, complete with dynamic factories and dynamic composition for flexibility • parallelizable, especially for distributed-computing environments 2. Data Format for I/O There are two main choices when building discriminative reranking models for speech or machine translation: (a) rescore a lattice or hypergraph or (b) simply use a strict reranking approach applied to n-best lists. For ReFr, early on we decided to use (b) reranking n-best lists. The primary reasons were the flexibility this would allow us in designing features and tools. N-best lists readily allow for sentence-level features in a way that, say, lattices do not. Additionally, it is far easier to de- fine generic schemes of passing around n-best lists than it is for designing schemes to take speech lattices as well as machine translation hypergraphs or other, problem-specific data types. ReFr is meant to be flexible enough to allow for a variety of data sources. In order to avoid the need for overly complex data formats, we have chosen to adopt a formalism which allows one to augment the input format, allowing for flexible feature extraction and data manipulation/analysis. We opted to use a data format which mirrors the data-structures that are used internally for training. The Google protocol buffers[1] provide a programming-language independent specification framework to define data formats. The protocol buffers specification language is used by the protocol buffer tools to generate source-code for serializing and deserializing the data stored in the format. Code is generated to allow for native programming-language encapsulation of the data. For example, in C++ each item of data is stored in an object based on a object oriented data specification (a C++ class) allowing for access to the data.1 3. Core learning framework Consider Algorithm 1, which describes the training procedure for a generic online-learning algorithm. Each training example ei comprises a set of candidate hypotheses, each of which is projected via some function Φ into a feature space, R F . We typically think of Φ as being a suite of feature functions, one per dimension. The model itself is defined as a weight vector in this space, w. Decoding, or inference, is carried out simply by taking the dot product of the model and a test instance. More generally, any kernel function K may be used. The training procedure iterates over the training data T—each iteration is called an epoch—until the NEEDTOKEEPTRAINING() predicate returns false. Often, such a predicate is based on the average loss of the current model on some held-out development data D, which is the purpose of the EVALUATE(D) line in the TRAIN(T) procedure. 1For the 2011 Johns Hopkins Workshop, we were targeting multiple tasks (ASR and MT), and so our toolkit provides a means to convert from two types of text-based n-best formats, one the output of an ASR system, the other the output of an MT system. These conversion tools are not only useful in their own right, but serve as example implementations for any developer converting from their own, proprietary format to the Google Protocol Buffer format used by ReFr. Copyright © 2013 ISCA 25-29 August 2013, Lyon, France INTERSPEECH 2013: Show & Tell Contribution 756Algorithm 1 Training algorithm for online-learning reranking models. Let ei = {c1, . . . , ck} be a training example, where each cj is a candidate hypothesis. Similarly, let di = {c1, . . . , ck} be a held-out development data example, also consisting of k candidate hypotheses. Finally, let K be a kernel function. procedure TRAIN(T = {e1, . . . , en}, D = {d1, . . . , dm}) while NEEDTOKEEPTRAINING() do TRAINONEEPOCH(T) EVALUATE(D) end while end procedure procedure TRAINONEEPOCH(T) foreach training example ei do SCORECANDIDATES(ei) if NEEDTOUPDATE() then UPDATE() end if end for end procedure procedure SCORECANDIDATES(ei) foreach candidate hypothesis cj ∈ ei do cj .score ← K(wt, cj ) end for end procedure Model Candidate Scorer Update Predicate Updater … Figure 1: A pictorial view of how a Model wraps instances of other interfaces that specify the predicates and functions needed to carry out model training. For the basic perceptron, the model starts out at time step 0 as the zero vector; that is, wo = ~0. The update is wt+1 = wt + Rt [Φ (yoracle (ei)) − Φ (ˆy (ei))] , (1) where yoracle is a function that picks out the hypothesis towards which we want to bias our model, yˆ is a function that picks out the candidate hypothesis we want to bias our model against and Rt is a learning rate or step size. Most often, yoracle is defined to pick the hypothesis with the lowest loss relative to some goldstandard truth, and yˆ is defined to pick the candidate hypothesis that scores highest under the current model wt. Most of the variations of this basic learning method involve finding different ways of defining Rt, Φ, yoracle and yˆ, along with the various procedures and predicates shown in Algorithm 1. Therefore, we would like our Reranker Framework to make it easy for the researcher to define these various functions, as well as to specify which ones to use at run-time. ReFr defines a Model interface with virtual methods for all of the functions shown in Algorithm 1. To avoid the exponential blow-up of overriding different combinations of these methods, ReFr also employs dynamic composition. That is, we keep the idea of a Model interface, but additionally have each Model instance wrap a set of predicate/manipulator objects, each of which itself conforms to an interface. Figure 1 shows a pictorial representation of this scheme. As we discussed above, we employ dynamic composition to avoid defining a new subclass of Model every time we wish model file = "my model file"; // model output file model = PerceptronModel( name("my model"), score comparator(DirectLossScoreComparator())); exec feature extractor = ExecutiveFeatureExtractorImpl( feature extractors({NgramFeatureExtractor(n(2)), RankFeatureExtractor()}); training efe = exec feature extractor; dev efe = exec feature extractor; training files = {"training1.gz", "training2.gz"}; devtest files = {"dev1.gz", "dev2.gz"}; Figure 2: An example ReFr configuration file, read by its Interpreter class. to explore a new combination of learning method functions. To do this, ReFr includes a very lightweight and yet powerful interpreter for a language that allows for assignment statements for primitives, vectors of primitives, Factory-constructible objects and vectors of Factory-constructible objects. Figure 2 shows an example ReFr configuration file. The syntax is intentionally very similar to that of C++. This lightweight language provides a flexible mechanism by which to specify how feature extraction, training and inference shall occur. 4. Cluster-based distributed training As Algorithm 1 shows, the basic perceptron algorithm involves “online” updating, and thus it is possible to read in each training example from file each time it is needed, only keeping the model’s parameters persistently in memory. The Reranker Framework allows both the memory-intensive way of training as well as this “streaming mode” version of training, essential for distributed learning. The structured perceptron [2] and it’s variants have proven to be effective in supervised, discriminative language modeling work [3]. We have centered the development of our opensource discriminative learning toolkit around perceptron-style algorithms, which are, by definition, online learning algorithms. Identifying the optimal solution for a distributed online optimization algorithm is still an open research question. We borrow from our previous work on distributed perceptron training in [4, 5] and use the Iterative Parameter Mixtures algorithm for distributed computation. The Reranker Framework makes it easy to switch between single processor and distributed training, which uses the Hadoop implementation of MapReduce [6]. 5. Demo Plan Our demo will consist of a walk-through of all ReFr’s features, followed by a hands-on demonstration of how easy it is to implement a new class of features for the reranker based on the rank of each candidate hypothesis. We will also show how easy it is to integrate that new class of features into training and inference. We will then demonstrate the ease with which one can use the API and the interpreted configuration language to alter the training algorithm. Finally, we will demonstrate the simple way that a user can switch from single processor training to large-scale distributed training. 6. Acknowledgements The authors would like to thank Prof. Brian Roark of Oregon Health and Science University for leading a fantastic team at the 2011 Johns Hopkins Workshop, and we would also like to thank all of our teammates, especially Prof. Izhak Shafran of OHSU and Ph.D. candidate Maider Lehr, who are actively working with and helping us improve ReFr. 7577. References [1] Google, “Protocol buffers,” http://code.google.com/apis/protocolbuffers/. [2] M. Collins, “Discriminative training methods for hidden Markov models: Theory and experiments with perceptron algorithms,” in Proc. EMNLP, 2002, pp. 1–8. [3] B. Roark, M. Sarac¸lar, and M. Collins, “Discriminative n-gram language modeling,” Computer Speech and Language, vol. 21, no. 2, pp. 373 – 392, 2007. [Online]. Available: http: //www.sciencedirect.com/science/article/pii/S0885230806000271 [4] R. McDonald, K. Hall, and G. Mann, “Distributed training strategies for the structured perceptron,” in HLT-NAACL, 2010. [5] K. Hall, S. Gilpin, and G. Mann, “Mapreduce/bigtable for distributed optimization,” in NIPS Workshop on Leaning on Cores, Clusters, and Clouds, 2010. [6] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing on large clusters,” CACM, vol. 51:1, 2008. 758 Accurate and Compact Large Vocabulary Speech Recognition on Mobile Devices Xin Lei1 Andrew Senior2 Alexander Gruenstein1 Jeffrey Sorensen2 1Google Inc., Mountain View, CA USA 2Google Inc., New York, NY USA {xinlei,andrewsenior,alexgru,sorenj}@google.com Abstract In this paper we describe the development of an accurate, smallfootprint, large vocabulary speech recognizer for mobile devices. To achieve the best recognition accuracy, state-of-the-art deep neural networks (DNNs) are adopted as acoustic models. A variety of speedup techniques for DNN score computation are used to enable real-time operation on mobile devices. To reduce the memory and disk usage, on-the-fly language model (LM) rescoring is performed with a compressed n-gram LM. We were able to build an accurate and compact system that runs well below real-time on a Nexus 4 Android phone. Index Terms: Deep neural networks, embedded speech recognition, SIMD, LM compression. 1. Introduction Smartphones and tablets are rapidly overtaking desktop and laptop computers as people’s primary computing device. They are heavily used to access the web, read and write messages, interact on social networks, etc. This popularity comes despite the fact that it is significantly more difficult to input text on these devices, predominantly by using an on-screen keyboard. Automatic speech recognition (ASR) is a natural, and increasingly popular, alternative to typing on mobile sevices. Google offers the ability to search by voice [1] on Android, iOS, and Chrome; Apple’s iOS devices come with Siri, a conversational assistant. On both Android and iOS devices, users can also speak to fill in any text field where they can type (see, e.g., [2]), a capability heavily used to dictate SMS messages and e-mail. A major limitation of these products is that speech recognition is performed on a server. Mobile network connections are often slow or intermittent, and sometimes non-existant. Therefore, in this study, we investigate techniques to build an accurate, small-footprint speech recognition system that can run in real-time on modern mobile devices. Previously, speech recognition on handheld computers and smartphones has been studied in the DARPA sponsored Transtac Program, where speech-to-speech translation systems were developed on the phone [3, 4, 5]. In the Transtac systems, Gaussian mixture models (GMMs) were used to as acoustic models. While the task was a small domain, with limited training data, the memory usage in the resulting systems was moderately high. In this paper, we focus on large vocabulary on-device dictation. We show that deep neural networks (DNNs) can provide large accuracy improvements over GMM acoustic models, with a significantly smaller footprint. We also demonstrate how memory usage can be significantly reduced by performing onthe-fly rescoring with a compressed language model during decoding. The rest of this paper is organized as follows. In Section 2, the embedded GMM acoustic model is described. Section 3 presents the training of embedded DNNs, and the techniques we employed to speed up DNN inference at runtime. Section 4 describes the compressed language models for on-the-fly rescoring. Section 5 shows the experimental results of recognition accuracy and speed on Nexus 4 platform. Finally, Section 6 concludes the paper and discusses future work. 2. GMM Acoustic Model Our embedded GMM acoustic model is trained on 4.2M utterances, or more than 3,000 hours of speech data containing randomly sampled anonymized voice search queries and other dictation requests on mobile devices. The acoustic features are 9 contiguous frames of 13-dimensional PLP features spliced and projected to 40 dimensions by linear discriminant analysis (LDA). Semi-tied covariances [6] are used to further diagonalize the LDA transformed features. Boosted-MMI [7] was used to train the model discriminatively. The GMM acoustic model contains 1.3k clustered acoustic states, with a total of 100k Gaussians. To reduce model size and speed up computation on embedded platforms, the floatingpoint GMM model is converted to a fixed-point representation, similar to that described in [8]. Each dimension of the Gaussian mean vector is quantized into 8 bits, and 16-bit for precision vector. The resulting fixed-point GMM model size is about 1/3 of the floating-point model, and there is no loss of accuracy due to this conversion in our empirical testing. 3. DNNs for Embedded Recognition We have previously described the use of deep neural networks for probability estimation in our cloud-based mobile voice recognition system [9]. We have adopted this system for developing DNN models for embedded recognition, and summarize it here. The model is a standard feed-forward neural network with k hidden layers of nh nodes, each computing a nonlinear function of the weighted sum of the outputs of the previous layer. The input layer is the concatenation of ni consecutive frames of 40-dimensional log filterbank energies calculated on 25ms windows of speech every 10ms. The no softmax outputs estimate the posterior of each acoustic state. We have experimented with conventional logistic nonlinearities and rectified linear units that have recently shown superior performance in our large scale task [10], while also reducing computation. Copyright © 2013 ISCA 25-29 August 2013, Lyon, France INTERSPEECH 2013 662While our server-based model has 50M parameters (k = 4, nh = 2560, ni = 26 and no = 7969), to reduce the memory and computation requirement for the embedded model, we experimented with a variety of sizes and chose k = 6, nh = 512, ni = 16 and no = 2000, or 2.7M parameters. The input window is asymmetric; each additional frame of future context adds 10ms of latency to the system so we limit ourselves to 5 future frames, and choose around 10 frames of past context, trading off accuracy and computation. Our context dependency (CD) phone trees were initially constructed using a GMM training system that gave 14,247 states. By pruning this system using likelihood gain thresholds, we can choose an arbitrary number of CD states. We used an earlier large scale model with the full state inventory that achieved around 14% WER to align the training data, then map the 14k states to the desired smaller inventory. Thus we use a better model to label the training data to an accuracy that cannot be achieved with the embedded scale model. 3.1. Training Training uses conventional backpropagation of gradients from a cross entropy error criterion. We use minibatches of 200 frames with an exponentially decaying learning rate and a momentum of 0.9. We train our neural networks on a dedicated GPU based system. With all of the data available locally on this system, the neural network trainer can choose minibatches and calculate the backpropagation updates. 3.2. Decoding speedup Mobile CPUs are designed primarily for lower power usage and do not have as many or as powerful math units as CPUs used in server or desktop applications. This makes DNN inference, which is mathematically computationally expensive, a particular challenge. We exploit a number of techniques to speed up the DNN score computation on these platforms. As described in [11], we use a fixed-point representation of DNNs. All activations and intermediate layer weights are quantized into 8-bit signed char, and biases are encoded as 32-bit int. The input layer remains floating-point, to better accommodate the larger dynamic ranges of input features. There is no measured accuracy loss resulting from this conversion to fixed-point format. Single Instruction Multiple Data (SIMD) instructions are used to speed up the DNN computation. With our choice of smaller-sized fixed-point integer units, the SIMD acceleration is significantly more efficient, exploiting up to 8 way parallelism in each computation. We use a combination of inline assembly to speed up the most expensive matrix multiplication functions, and compiler intrinsics in the sigmoid and rectified linear calculations. Batched lazy computation [11] is also performed. To exploit the multiple cores present on modern smartphones, we compute the activations up to the last layer in a dedicated thread. The output posteriors of the last layer are computed only when needed by the decoder in a separate thread. Each thread computes results for a batch of frames at a time. The choice of batch size is a tradeoff between computation efficiency and recognition latency. Finally, frame skipping [12] is adopted to further reduce computation. Activations and posteriors are computed only every nb frames and used for nb consecutive frames. In experiments we find that for nb = 2, the accuracy loss is negligible; however for nb ≥ 3, the accuracy degrades quickly. 4. Language Model Compression We create n-gram language models appropriate for embedded recognition by first training a 1M word vocabulary and 18M n-gram Katz-smoothed 4-gram language model using Google’s large-scale LM training infrastructure [13]. The language model is trained using a very large corpus (on the order of 20 billion words) from a variety of sources, including search queries, web documents and transcribed speech logs. To reduce memory usage, we use two language models during decoding. First, a highly-pruned LM is used to build a small CLG transducer [14] that is traversed by the decoder. Second, we use a larger LM to perform on-the-fly lattice rescoring during search, similar to [15]. We have observed that a CLG transducer is generally two to three times larger than a standalone LM, so this rescoring technique significantly reduces the memory footprint. Both language models used in decoding are obtained by shrinking the 1M vocabulary and 18M n-gram LM. We aggressively reduce the vocabulary to the 50K highest unigram terms. We then apply relative entropy pruning [16] as implemented in the OpenGrm toolkit [17]. The resulting finite state model for rescoring LM has 1.4M n-grams, with just 280K states and 1.7M arcs. The LM for first pass decoding contains only unigrams and about 200 bigrams. We further reduce the memory footprint of the rescoring LM by storing it in an extremely memory-efficient manner, discussed below. 4.1. Succinct storage using LOUDS If you consider a backoff language model’s structure, the failure arcs from (n + 1)-gram contexts to n-gram contexts and, ultimately, to the unigram state form a tree. Trees can be stored using 2 bits per node using a level-order unary degree sequence (LOUDS), where we visit the nodes breadth-first writing 1s for the number of (n + 1)-gram contexts and then terminating with a 0 bit. We build a bit sequence similarly for the degree of outbound non-φ arcs. The LOUDS data structure provides first-child, last-child, and parent navigation, so we are able to store a language model without storing any next-state values. As a contiguous, indexfree data object, the language model can be easily memory mapped. The implementation of this model is part of the OpenFst library [18] and covered in detail in [19]. The precise storage requirements, measured in bits, are 4ns + na + (W + L)(ns + na) + W nf + c where ns is the number of states, nf the number of final states, na is the number of arcs, L is the number of bits per wordid, and W is the number of bits per probability value. This is approximately one third the storage required by OpenFst’s vector representation. For the models discussed here, we use 16 bits for both labels and weights. During run time, to support fast navigation in the language model, we build additional indexes of the LOUDS bit sequences to support the operations rankb(i) the number of b valued bits before index i, and its inverse selectb(r). We maintain a two level index that adds an additional 0.251(4ns + na) bits. Here it is important to make use of fast assembly operations such as find first set during decoding, which we do through compiler intrinsics. 6634.2. Symbol table compression The word symbol table for an LM is used to map words to unique identifiers. Symbol tabels are another example of a data structure that can be represented as a tree. In this case we relied upon the implementation contained in the MARISA library [20]. This produces a symbol table that fits in just one third the space of the concatenated strings of the vocabulary, yet provides a bidirectional mapping between integers and vocabulary strings. We are able to store our vocabulary in about 126K bytes, less than 3 bytes per entry in a memory mappable image. The MARISA library assigns the string to integer ids during compression, so we relabel all of the other components in our system to match this assignment. 5. Experimental Results To evaluate accuracy performance, we use a test set of 20,000 anonymized transcribed utterances from users speaking in order to fill in text fields on mobile devices. This biases the test set towards dictation, as opposed to voice search queries, because dictation is more useful than search when no network connection is available. To measure speed performance, we decode a subset of 100 utterances on an Android Nexus 4 (LG) phone. The Nexus 4 is equipped with a 1.5GHz quad-core Qualcomm Snapdragon S4 pro CPU, and 2GB of RAM. It runs the Android 4.2 operating system. To reduce start up loading time, all data files, including the acoustic model, the CLG transducer, the rescoring LM and the symbol tables are memory mapped on the device. We use a background thread to “prefetch” the memory mapped resources when decoding starts, which mitigates the slowdown in decoding for the first several utterances. 5.1. GMM acoustic model The GMM configuration achieves a word error rate (WER) of 20.7% on this task, with an average real-time (RT) factor of 0.63. To achieve this speed, the system uses integer arithmetic for likelihood calculation and decoding. The Mahalanobis distance computation is accelerated using fixed-point SIMD instructions. Gaussian selection is used to reduce the burden of likelihood computation, and further efficiencies come from computing likelihoods for batches of frames. 5.2. Accuracy with DNNs We compare the accuracy of DNNs with different configurations to the baseline GMM acoustic model in Table 1. A DNN with 1.48M parameters already outperforms the GMM in accuracy, with a disk size of only 17% of the GMM’s. By increasing the number of hidden layers from 4 to 6 and number of outputs from 1000 to 2000, we obtain a large improvement of 27.5% relative in WER compared to the GMM baseline. The disk size of this DNN is 26% of the size of the GMMs. For comparison, we also evaluate a server-sized DNN with an order of magnitude of more parameters, and it gives 12.3% WER. Note that all experiments in Table 1 use smaller LMs in decoding. In addition, with an un-pruned server LM, the server DNN achieves 9.9% WER while the server GMM achieves 13.5%. Therefore, compared to a full-size DNN server system, there is a 2.4% absolute loss due to smaller LMs, and 2.8% due to smaller DNN. Compared to the full-size GMM server system, the embedded DNN system is about 10% relatively worse in WER. The impact of frame skipping is evaluated with the DNN 6×512 model. As shown in Table 2, the accuracy performance quickly degrades when nb is larger than 2. Table 2: Accuracy results with frame skipping in a DNN system. nb 1 2 3 4 5 WER (%) 15.1 15.2 15.6 16.0 16.7 5.3. Speed benchmark For speed benchmark, we measure average RT factor as well as 90-percentile RT factor. As shown in Table 3, the baseline GMM system with SIMD optimization gives an average RT factor of 0.63. The fixed-point DNN gives 1.32×RT without SIMD optimization, and 0.75×RT with SIMD. Batched lazy computation improves average RT by 0.06 but degrades the 90- percentile RT performance, probably due to less efficient ondemand computation for difficult utterances. After frame skipping with nb = 2, the speed of DNN system is further improved slightly to 0.66×RT. Finally, the overhead of the compact LOUDS based LM is about 0.13×RT on average. Table 3: Averge real-time (RT) and 90-percentile RT factors of different system settings. Average RT RT(90) GMM 0.63 0.90 DNN (fixed-point) 1.32 1.43 + SIMD 0.75 0.87 + lazy batch 0.69 1.01 + frame skipping 0.66 0.97 + LOUDS 0.79 1.24 5.4. System Footprint Compared to the baseline GMM system, the new system with LM compression and DNN acoustic model achieves a much smaller footprint. The data files sizes are listed in Table 4. Note that conversion of the 34MB floating-point GMM model to a 14MB fixed-point GMM model itself provides a large reduction in size. The use of DNN reduces the size by 10MB, and the LM compression contributed to another 18MB reduction. Our final embedded DNN system size is reduced from 46MB to 17MB, while achieving a big WER reduction from 20.7% to 15.2%. 6. Conclusions In this paper, we have described a fast, accurate and smallfootprint speech recognition system for large vocabulary dictation on the device. DNNs are used as acoustic model, which provides a 27.5% relative WER improvement over the baseline GMM models. The use of DNNs also significantly reduces the memory usage. Various techniques are adopted to speed up the DNN inference at decoding time. In addition, a LOUDS based language model compression reduces the rescoring LM size by more than 60% relative. Overall, the size of the data files of the system is reduced from 46MB to 17MB. 664Table 1: Comparison of GMM and DNNs with different sizes. The input layer is denoted by number of filterbank energies × the context window size (left + current + right). The hidden layers are denoted by number of hidden layers × number of nodes per layer. The number of outputs is the number of HMM states in the model. Model WER (%) Input Layer Hidden Layers # Outputs # Parameters Size GMM 20.7 - - 1314 8.08M 14MB DNN 4×400 22.6 40×(8+1+4) 4×400 512 0.9M 1.5MB DNN 4×480 20.3 40×(10+1+5) 4×480 1000 1.5M 2.4MB DNN 6×512 15.1 40×(10+1+5) 6×512 2000 2.7M 3.7MB Server DNN 12.3 40×(20+1+5) 4×2560 7969 49.3M 50.8MB Table 4: Comparison of data file sizes (in MB) in baseline GMM system and DNN system with and without LOUDS LM compression. AM denotes acoustic model, CLG is the transducer for decoding, LM denotes the rescoring LM, and symbols denote the word symbol table. System AM CLG LM Symbols Total GMM 14 2.7 29 0.55 46 + LOUDS 14 2.7 10.7 0.13 27 DNN 3.7 2.8 29 0.55 36 + LOUDS 3.7 2.8 10.7 0.13 17 Future work includes speeding up rescoring using the LOUDS LM as well as further compression techniques. We also continue to investigate the accuracy performance with different sizes of LM for CLG and rescoring. 7. Acknowledgements The authors would like to thank our former colleague Patrick Nguyen for implementing the portable neural network runtime engine used in this study. Thanks also to Vincent Vanhoucke and Johan Schalkwyk for helpful discussions and support during this work. [9] N. Jaitly, P. Nguyen, A. W. Senior, and V. Vanhoucke, “Application of pretrained deep neural networks to large vocabulary speech recognition,” in Proc. Interspeech, 2012. [10] M. D. Zeiler et al., “On rectified linear units for speech processing,” in Proc. ICASSP, 2013. [11] V. Vanhoucke, A. Senior, and M. Z. Mao, “Improving the speed of neural networks on CPUs,” in Proc. Deep Learning and Unsupervised Feature Learning NIPS Workshop, 2011. [12] V. Vanhoucke, M. Devin, and G. Heigold, “Multiframe deep neural networks for acoustic modeling,” in Proc. ICASSP, 2013. [13] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in EMNLP, 2007, pp. 858–867. [14] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with weighted finite-state transducers,” Handbook of Speech Processing, pp. 559–582, 2008. [15] T. Hori and A. Nakamura, “Generalized fast on-the-fly composition algorithm for WFST-based speech recognition,” in Proc. Interspeech, 2005. [16] A. Stolcke, “Entropy-based pruning of backoff language models,” in DARPA Broadcast News Transcription and Understanding Workshop, 1998, pp. 8–11. [17] B. Roark, R. Sproat, C. Allauzen, M. Riley, J. Sorensen, and T. Tai, “The OpenGrm open-source finite-state grammar software libraries,” in Proceedings of the ACL 2012 System Demonstrations. 2012, ACL ’12, pp. 61–66, Association for Computational Linguistics. 8. References [1] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, and B. Strope, “Google search by voice: A case study,” in Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics, pp. 61–90. Springer, 2010. [2] B. Ballinger, C. Allauzen, A. Gruenstein, and J. Schalkwyk, “Ondemand language model interpolation for mobile speech input,” in Proc. Interspeech, 2010. [3] J. Zheng et al., “Implementing SRI’s Pashto speech-to-speech translation system on a smart phone,” in SLT, 2010. [4] J. Xue, X. Cui, G. Daggett, E. Marcheret, and B. Zhou, “Towards high performance LVCSR in speech-to-speech translation system on smart phones,” in Proc. Interspeech, 2012. [5] R. Prasad et al., “BBN Transtalk: Robust multilingual two-way speech-to-speech translation for mobile platforms,” Computer Speech and Language, vol. 27, pp. 475–491, February 2013. [6] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE Trans. Speech and Audio Processing, vol. 7, pp. 272–281, 1999. [7] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, and K. Visweswariah, “Boosted MMI for model and feature-space discriminative training,” in Proc. ICASSP, 2008. [8] E. Bocchieri, “Fixed-point arithmetic,” Automatic Speech Recognition on Mobile Devices and over Communication Networks, pp. 255–275, 2008. [18] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri, “OpenFst: A general and efficient weighted finite-state transducer library,” in Proceedings of the Ninth International Conference on Implementation and Application of Automata, (CIAA 2007). 2007, vol. 4783 of Lecture Notes in Computer Science, pp. 11– 23, Springer, http://www.openfst.org. [19] J. Sorensen and C. Allauzen, “Unary data structures for language models,” in Proc. Interspeech, 2011. [20] S. Yata, “Prefix/Patricia trie dictionary compression by nesting prefix/Patricia tries (japanese),” in Proceedings of 17th Annual Meeting of the Association for Natural Language, Toyohashi, Japan, 2011, NLP2011, https://code.google.com/p/marisa-trie/. 665 Backoff Inspired Features for Maximum Entropy Language Models Fadi Biadsy, Keith Hall, Pedro Moreno and Brian Roark Google, Inc. {biadsy,kbhall,pedro,roark}@google.com Abstract Maximum Entropy (MaxEnt) language models [1, 2] are linear models that are typically regularized via well-known L1 or L2 terms in the likelihood objective, hence avoiding the need for the kinds of backoff or mixture weights used in smoothed ngram language models using Katz backoff [3] and similar techniques. Even though backoff cost is not required to regularize the model, we investigate the use of backoff features in MaxEnt models, as well as some backoff-inspired variants. These features are shown to improve model quality substantially, as shown in perplexity and word-error rate reductions, even in very large scale training scenarios of tens or hundreds of billions of words and hundreds of millions of features. Index Terms: maximum entropy modeling, language modeling, n-gram models, linear models 1. Introduction A central problem in language modeling is how to combine information from various model components, e.g., mixing models trained with differing Markov orders for smoothing or on distinct corpora for adaptation. Smoothing (regularization) for n-gram language models is typically presented as a mechanism whereby higher-order models are combined with lower-order models so as to achieve both the specificity of the higher-order model and the more robust generality of the lower-order model. Most commonly, this combination is effected via an interpolation or backoff mechanism, in which each prefix (history) of an n-gram has a parameter which dictates how much cost is associated with making use of lower-order n-gram estimates, often called the “backoff cost”. This becomes a parameter estimation problem in its own right, either through discounting or mixing parameters; and these are often estimated via extensive parameter tying, heuristics based on count histograms, or both. Log linear models provide an alternative to n-gram backoff or interpolated models for combining evidence from multiple, overlapping sources of evidence, with very different regularization methods. Instead of defining a specific model structure with backoff costs and/or mixing parameters, these models combine features from many sources into a single linear feature vector, and score a word by taking the dot product of the feature vector with a learned parameter vector. Learning can be via locally normalized likelihood objective functions, as in Maximum Entropy (MaxEnt) models [1, 2, 4] or global “whole sentence” objectives [5, 6, 7]. For locally normalized MaxEnt models, which estimate a conditional distribution over a vocabulary given the prefix history (just as the backoff smoothed n-gram models do), the brute-force local normalization over the vocabulary obviates the need for complex backoff schemes to avoid zero probabilities. One can simply toss in n-gram features of all the orders, and learn their relative contribution. Recall, however, that the standard backoff n-gram models do not only contain parameters associated with n-grams; they also contain parameters associated with the backoff weights for each prefix history. For every proper prefix of an n-gram in the model, there will be an associated backoff weight, which penalizes to a greater or lesser extent words that have been previously unseen following that prefix history. For some histories we should have a relatively high expectation of seeing something new, either because the history itself is rare (hence we do not have enough observations yet to be strongly predictive) or it simply predicts relatively open classes of possible words, e.g., “the”, which can precede many possible words, including many that were presumably unobserved following “the” in the training corpus. Other prefixes may be highly predictive so that the expectation of seeing something previously unobserved is relatively low, e.g., “Barack”. Granted, MaxEnt language models (LMs) do not need this information about prefix histories to estimate regularized probabilities. Chen and Rosenfeld [4] survey various smoothing and regularization methods for MaxEnt language models, including reducing the number of features (as L1 regularization does), optimizing to match expected frequencies to discounted counts, or optimizing to modified objectives, such as L2 regularization. In none of these methods are there parameters in the model associated with the sort of “otherwise” semantics of conventional n-gram backoffs. Because such features are not required for smoothing, they are not part of the typical feature set used in log linear language modeling, yet our results demonstrate that they should be. The ultimate usefulness of such features likely depends on the amount of training data available, and we have thus applied highly optimized MaxEnt training to very large data sets. In large scale n-gram modeling, it has been shown that the specific details of the smoothing algorithm is typically less important than the scale. So-called “stupid backoff” [8] is an efficient, scalable estimation method that, despite lack of normalization guarantees, is shown to be extremely effective in very large data set scenarios. While this has been taken to demonstrate that the specifics of smoothing is unimportant as the data gets large, those parameters are still important components of the modeling approach, even if their usefulness is robust to variation in parameter value. We demonstrate that features patterned after backoff weights, and several related generalizations of these features, can in fact make a large difference to a MaxEnt language model, even if the amount of training data is very large. In the next section, we present background for language modeling and cover related work. We then present our MaxEnt training approach, and the new features. Finally, we present experimental results on a range of large scale speech tasks. 2. Background and Related Work Let wi be the word at position i in the string, and let w i−1 i−k = wi−k . . . wi−1 be the prefix history of the string prior to wi, and P a probability estimate assigned to seen n-grams by the specific smoothing method. Then the standard backoff language model formulation is as follows: P(wi | w i−1 i−k ) = ( P(wi | w i−1 i−k ) if c(wi i−k ) > 0 α(w i−1 i−k ) P(wi | w i−1 i−k+1) otherwise This recursive smoothing formulation has two kinds of paramCopyright © 2014 ISCA 14-18 September 2014, Singapore INTERSPEECH 2014 2645eters: n-gram probabilities P(wi | w i−1 i−k ) and backoff weights α(w i−1 i−k ), which are parameters associated with the prefix history w i−1 i−k . MaxEnt models are log linear models score that alternatives by taking the exponential of the dot product between a feature vector and a parameter vector and normalizing. Let Φ(wi−k . . . wi) be a d-dimensional feature vector, θ a ddimensional parameter vector, and V a vocabulary. Then P(wi | w i−1 i−k ) = exp(Φ(wi−k . . . wi) · θ) Z(wi−k . . . wi−1, θ) where Z is a partition function (normalization constant): Z(wi−k, . . . , wi−1, θ) = X v∈V exp(Φ(wi−k, . . . , wi−1v) · θ) Training with a likelihood objective function is a convex optimization problem, with well-studied efficient estimation techniques, such as stochastic gradient descent. Regularization techniques are also well-studied, and include L1 and L2 regularization, or their combination, which are modifications of the likelihood objective to either keep parameter values as close to zero as possible (L2) or reduce the number of features with nonzero parameter weights by pushing many parameters to zero (L1). We employ a distributed approximation to L1, see Section 3.1. The most expensive part of this optimization is the calculation of the partition function, since it requires summing over the entire vocabulary, which can be very large. Efficient methods to enable training with very large corpora and large vocabularies have been investigated over the past decades, from methods to exploit structural overlap between features [9, 10] to methods for decomposing the multi-class language modeling problem into many binary language modeling problems (one versus the rest) and sampling less data to effectively learn the models [11]. For this paper, we employed many optimizations to enable training with very large vocabularies (several hundred thousand words) and very large training sets (>100B words). 3. Methods 3.1. Maximum Entropy training Many features have been used in MaxEnt language models, including standard n-grams and trigger words [1], topic-based features [12] and morphological and sub-word based features [13, 14]. Feature engineering is a major consideration in this sort of modeling, and in Section 3.2 we detail our newly designed feature templates. Before we do so, we present the training methods that allow us to scale up to a very large vocabulary and many training instances. In this work, we wish to scale up MaxEnt language model training to learn from the same amount of data used for standard backoff n-gram language models. We achieve this by exploiting recent work on gradient-based distributed optimization; specifically, distributed stochastic gradient descent (SGD) [15, 16, 17, 18, 19]. We differ slightly from previous work in multiple aspects: (1) we apply a final L1 regularization setp at the end of each reducer using statistics collected from the mappers; (2) We estimate the gradient using a mini-batch of 16 samples where the mini-batch is processed in parallel via multi-threading; (3) We do not perform any binarization or subsampling as in [20]; (4) Unlike [21], we do not peform any clustering of our vocabulary. Algorithm 1 presents our variant of the iterative parameter mixtures (IPM) algorithm based on sampling. This presents a merging of concepts from the original IPM algorithm described in [16] and the distributed sample-based algorithm in [18] as well as the lazy L1 SGD computation from [22]. Algorithm 1 Sample-based Iterative Parameter Mixtures Require: n is the number of samples per worker per epoch Require: Break S into K partitions 1: S ← {D 1 , . . . , Dj , . . . , DK} 2: t ← 0 3: Θt ← 0 4: repeat 5: t ← t + 1 6: {θ 1 1, . . . , θK L } ← IPMMAP(D 1 , . . . , DK, Θt−1, n) 7: Θ 0 t ← IPMREDUCE(θ 1 1, . . . , θj l , . . . , θK L ) 8: Θt ← APPLYL1(Θ0 t) 9: until converged 10: function IPMMAP(D, Θ, n) 11: . IPMMAP processes training data in parallel 12: Θ0 ← Θ 13: for i = 1 . . . n do . n examples from D 14: Sample di from D 15: Θ 0 i ← ApplyLazyL1(ActiveF eatures(di, Θi−1)) 16: Θi ← Θ 0 i − α∇Fdi (Θ0 i) 17: α ← U pdateAlpha(α, i) 18: end for 19: return Θn 20: end function 21: function IPMREDUCE(θ 1 l , . . . , θj l , . . . , θK l ) 22: . IPMREDUCE processes model parameters in parallel 23: θl ← 1 K P j θ j l 24: return θl 25: end function While this is a general paradigm for distributed optimization, we show the MapReduce [23] implementation in Algorithm 1. We begin the process by partitioning the training data S into multiple units D j , processing each of these units with the IPMMAP function on separate processing nodes. On each of these nodes, IPMMAP samples a subset of D j which we call di. This can be a single example or a mini-batch of examples. We perform the Lazy L1 regularization update to the model, compute the gradient of the regularized loss associated with the mini-batch (which can be also be done in parallel), update the local copy of the model parameters Θ, and update the learningrate α. Each node samples n examples from its data partition. Finally, IPMREDUCE collects the local model parameters from each IPMMAP and averages them in parallel. Parallelization here can be done over subsets of the parameter indices (each IPMREDUCE node averages a subset of the parameter space). We refer to each full MapReduce pass as an epoch of training. Starting with the second epoch, the IPMMAP nodes are initialized with the previous epoch’s merged, regularized model. In a general shared distributed framework, which is used at Google, some machines may be slower than others (due to hardware or overload), machines may fail, or jobs may be preempted. When using a large number of machines this is inevitable. To avoid starting the training process over in these cases, and make all others wait for for the lagging machines, we enforce a timeout on our trainers. In other words, all mappers have to finish within a certain amount of time. Therefore, the reducer will merge all models when they either finished processing their samples or timed-out. 3.2. Backoff inspired features MaxEnt language models commonly have n-gram features, which we denote here as a function of the string, the position, 2646and the order as follows: NGram(w1 . . . wn, i, k) = We now introduce some features inspired by the backoff parameters α(w i−1 i−k ) presented in Section 2. We begin with the most directly related features, which we term suffix backoff features. SuffixBackoff(w1 . . . wn, i, k) = These fire if and only if the full n-gram NGram(w1 . . . wn, i, k) is not in the feature dictionary (see section 4.1). This is directly analogous to the backoff weights in standard n-gram models, since it is a parameter associated with the prefix history that fires when the particular n-gram is unobserved. Inspired by the form of this feature, we can introduce other general backoff features. First, rather than just replacing the suffix, we can replace the prefix: PrefixBackoff(w1 . . . wn, i, k) = Next, we can replace multiple words in the feature, to generalize across several such contexts: PrefixBackoffj (w1 . . . wn, i, k) = SuffixBackoffj (w1 . . . wn, i, k) = These features indicate that an n-gram of length k + 1 ending with (PrefixBackoff), or beginning with (SuffixBackoff), the particular j words, in the feature, are not in the feature dictionary. Note that, if j=k−1, then PrefixBackoffj is identical to the earlier defined PrefixBackoff feature, and SuffixBackoffj is identical to SuffixBackoff. For example, suppose that we have the following string S=“we will save the quail eggs” and that the 4-gram “will save the quail” does not exist in our feature dictionary. Then we can fire the following features at word wi=5 = “quail”: SuffixBackoff(S, 5, 3) = < will, save, the, BO > PrefixBackoff(S, 5, 3) = < BO, save, the, quail > SuffixBackoff0(S, 5, 3) = < will, BO3 > SuffixBackoff1(S, 5, 3) = < will, save, BO3 > PrefixBackoff0(S, 5, 3) = < BO3, quail > PrefixBackoff1(S, 5, 3) = < BO3, the, quail > As with n-gram feature templates, we include all such features up to some specified length, e.g., if we have a trigram model, that includes n-grams up to length 3, including unigrams, bigrams and trigrams. Similarly, for our prefix and suffix backoff features, we will have a maximum length and include in our possible feature set all such features of that length or shorter. 4. Experimental results We performed two experiments to evaluate the utility of these new backoff-inspired features in maximum entropy language models trained on very large corpora. First, we examine perplexity improvements when such features are included in the model alongside n-gram features. Next, we look at Word Error Rate (WER) performance when reranking the output of a baseline recognizer, again using different backoff feature templates. In all cases, we fixed the vocabulary and feature budget of the model so that improvements are not simply due to having more parameters in the model. We set the vocabulary of our model to 200 thousand words, by selecting all words from the 2M words in the baseline recognizer vocabulary that had been emitted by the recognizer in the last 6 months of log files. All other words are mapped to “”. We use the same vocabulary in all of our experiments. For our experiments, we focus on the voice search task. Our data sets are assembled and pooled from anonymized supervised and unsupervised spoken queries (such as, search queries, questions, and voice actions) and typed queries to google.com, YouTube, and Google Maps, from desktop and mobile devices. Our overall training set is about 305 billion words (including end of sentence symbols). We divide this set into K subsets. We assign subset D k to trainer k (where, 1 ≤ k ≤ K). Then, we run our distributed training (Algorithm 1) using K machines. Since the amount of training data is very large, trainer k randomly samples data points from its subset D k . Each epoch utilizes a different seed for sampling, which equals to the epoch number. As mentioned above, the trainer may terminate due to completing its subsample or due to a timeout. We fix the timeout threshold for each epoch across all our experiments. In our experiments, the timout is 6 hours. 4.1. Feature Dictionary A feature dictionary maps each feature key (e.g., trigram: “save the quail”) to an index in the paramater vector Θ. As described in Algorithm 2, we build this dictionary by iterating over all strings in our training data and make use of the NGgram function (defined above) to build the ngram feature keys (for every k = 0 . . . 4). Also, for each string, we build the required backoff feature keys (depending on the experiment). Upon collecting all of these keys, we compute the total observed count for each feature key and then retain only the most frequent ones. We assign a different count cutoff for each feature template. We determine these counts based on a classical cross-entropy pruned n-gram model trained on the same data Afterwards, our dictionary maps each key to a unique consecutive index = 0 . . . Dim. In all our experiments, we allocated the same budget of 228 million paramaters. It is important to note that the number of features dedicated for backoff features may significantly vary across backoff-feature types. Note that, while the backoff inspired features detailed in section 3.2 are defined to fire only when the corresponding ngram does not appear in the feature dictionary, they themselves must appear in the feature dictionary in order to fire. If one of these features does not appear frequently enough, it will not appear in the feature dictionary and neither the original n-gram nor the backoff feature will fire. 4.2. Feature Sets In these experiments, all MaxEnt language models include ngrams up to 5-grams. Our backoff inspired features are also Algorithm 2 Dictionary Construction for all w1, w2, . . . , wn ∈ Data do for i ← 1 . . . n do . We use 5-gram features. for k ← 0 . . . 4 do key ← NGram(w1, . . . , wn, i, k) dictk ← dictk ∪ {key} countk[key] ← countk[key] + 1 . Call the backoff functions above. bo key ← SuffixBackoff(w1, . . . , wn, i, k) dictk ← dictk ∪ {bo key} countk[bo key] ← countk[bo key] + 1 end for end for end for . Retain the most frequent features in dictk and map each feature to a unique index, for each k = 0, . . . , 4. 2647Perplexity Figure 1: Perplexity versus number of epochs of training for various feature sets under the same feature budget constraint. Feature sets include: (1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff (S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk). Perplexity Figure 1: Perplexity versus number of epochs of training for various feature sets under the same feature budget constraint. Feature sets include: (1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff (S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk). Epochs Figure 1: Perplexity versus number of epochs of training for various feature sets under the same feature budget constraint. Feature sets include: (1) n-gram features (NG); (2) PrefixBackoff (P); (3) SuffixBackoff (S); (4) PrefixBackoff-k (Pk); and (5) SuffixBackoff-k (Sk). based on substrings up to length 5, i.e., up to 4 words, either preceded (prefix) or followed (suffix) by the “BO” token in the case of PrefixBackoff and SuffixBackoff features; or “BOj ” up to j = 4 preceding (prefix) or following (suffix) the word. We examine several feature set pools: (1) n-gram features alone (NG); (2) n-gram features plus PrefixBackoff (NG+P) or SuffixBackoff (NG+S); (3) n-gram features plus PrefixBackoffj (NG+Pk) or SuffixBackoffj (NG+Sk); and (4) n-gram features plus PrefixBackoffj and SuffixBackoff (NG+Pk+S) or SuffixBackoffj (NG+Pk+Sk). In each case, feature dictionaries are built, so they may contain more or fewer n-grams as required to include the backoff features in the dictionary. For the current experiments, trials with PrefixBackoffj or SuffixBackoffj only include features with j = 0, i.e., a single word alongside the “BOk” token. Note that the number of such features is relatively constrained compared to the n-gram features and other backoff features – at most k|V | possible features for a vocabulary V . 4.3. Perplexity Perplexity was measured on a held-aside random sample of 5 million words from our pool of data. Figure 1 plots perplexity versus number of epochs (up to 11) for different possible feature sets. Recall that data is randomly sampled from the overall training set, so that this plot also shows behavior as the amount of training data is increased. Table 1 presents perplexities after the epoch 11, along with the number of samples used during the training and number of active features with non-zero parameters. The number of samples varies because some trainers may run faster than others depending on the number and type of features used; since we enforce a timeout, an epoch may vary in the number of samples processed in time. Nonetheless, Figure 1 shows that most models have approached or reached convergence before completing all the 11 epochs. A notable exception is the n-gram only model, which seems to require a few more epochs before reaching convergence – though clearly performance will not reach that of the other trials. This points to another benefit of the backoff features – they also seem to speed convergence for these models. Interestingly, they also seem to considerably reduce the number of active features. The results show a large perplexity improvement due to the use of backoff features, and in particular the generalized Prefix/SuffixBackoff-k features. One potential reason for the improved performance with these generalized backoff features is the relatively small number of them and they fire more often, as discussed in the previous section. Feature Set Description Pplx Samp ActFt NG N-grams only 167.0 137B 197.8M NG+P N-grams + PrefixBackoff 122.6 112B 189.5M NG+S N-grams + SuffixBackoff 109.8 125B 188.9M NG+Pk N-grams + PrefixBackoffk 88.0 100B 170.1M NG+Pk+S N-grams + PrefixBackoffk + SuffixBackoff 85.5 113B 172.6M NG+Sk N-grams + SuffixBackoffk 82.7 126B 160.2M NG+Pk+Sk N-grams + PrefixBackoffk + SuffixBackoffk 80.2 96B 162.4M Table 1: Perplexity (Pplx) after 11 epochs of training, with a fixed feature budget. Also giving number of samples (Samp) used for training each model, in billions; and active features (ActFt), in millions. 4.4. Speech Recognition Rescoring Results We evaluated our models by rescoring n-best outputs from a baseline recognizer. In our experiments, we set n to 500. The acoustic model of the baseline system is a deep-neural networkbased model with 85M parameters, consisting of eight hidden layers with 2560 Rectified Linear hidden units each and softmax outputs for the 14,000 context-dependent state posteriors. The network processes a context window of 26 (20 past and 5 future) frames of speech, each represented with 40 dimensional log mel filterbank energies taken from 25ms windows every 10ms. The system is trained to a Cross-Entropy criterion on a US English data set of 3M anonymized utterances (1,700 hours or about 600 million frames) collected from live voice search dictation trafic. The utterances are hand-transcribed and force-aligned with a previously trained DNN. See [24] for Google’s VoiceSearch system design. The baseline LM is a Katz [3] smoothed 5-gram model pruned to 23M n-grams, trained on the same data using Bayesian interpolation to balance multiple sources. It has vocabulary size of 2M and an OOV rate of 0.57% [25]. The score assigned to each hypothesis by our MaxEnt LM is linearly interpolated with the baseline recognizer’s LM score (with an untuned mixture factor of 0.33). Table 2 presents WER results for multiple anonymized voice-search data sets collected from anonymized and manually transcribed live traffic from mobile devices. These data sets contain regular spoken search queries, questions, and YouTube queries. We achieve modest gains over the baseline system and over rescoring with just ngram features in all of the test sets, achieving, in aggregate, a half a point of improvement over the baseline system. 5. Conclusion In this paper we introduced and explored features for maximum entropy language models inspired by the backoff mechanism of standardly smoothed language models. We found large perplexity improvements over using n-gram features alone, for the same feature budget; and a 0.5% absolute (3.4% relative) WER improvement over the baseline system for our best performing model. Future work will include exploring further variants of our general backoff feature templates and combining with other features beyond n-grams. Table 2: WER results on 7 sub-corpora and overall, for the baseline recognizer (no reranking) versus reranking models trained with different feature sets. Reranking feature set Test Utts / Wds NG+ NG+ NG+ Set (×1000) None NG Pk Sk Pk+Sk 1 22.5 / 98.0 12.7 12.6 12.4 12.4 12.4 2 17.8 / 74.0 12.7 12.5 12.4 12.4 12.3 3 16.2 / 61.1 17.3 17.1 16.7 16.8 16.7 4 18.0 / 64.0 12.8 12.7 12.6 12.6 12.5 5 7.4 / 50.7 16.8 16.6 16.2 16.2 16.2 6 7.3 / 31.9 15.1 15.0 14.8 14.8 14.9 7 19.6 / 69.1 16.5 16.2 15.9 15.9 15.9 all 108.9 / 448.8 14.6 14.4 14.2 14.2 14.1 26486. References [1] R. Lau, R. Rosenfeld, and S. Roukos, “Trigger-based language models: a maximum entropy approach,” in Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 1993, pp. 45– 48. [2] R. Rosenfeld, “A maximum entropy approach to adaptive statistical language modeling,” Computer Speech and Language, vol. 10, pp. 187–228, 1996. [3] S. M. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recogniser,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 35, no. 3, pp. 400–401, 1987. [4] S. F. Chen and R. Rosenfeld, “A survey of smoothing techniques for ME models,” IEEE Transactions on Speech and Audio Processing, vol. 8, pp. 37–50, 2000. [5] R. Rosenfeld, “A whole sentence maximum entropy language model,” in Proceedings of IEEE Workshop on Speech Recognition and Understanding, 1997, pp. 230– 237. [6] R. Rosenfeld, S. F. Chen, and X. Zhu, “Whole-sentence exponential language models: a vehicle for linguisticstatistical integration,” Computer Speech and Language, vol. 15, no. 1, pp. 55–73, Jan. 2001. [7] B. Roark, M. Saraclar, and M. Collins, “Discriminative ngram language modeling,” Computer Speech & Language, vol. 21, no. 2, pp. 373–392, 2007. [8] T. Brants, A. C. Popat, P. Xu, F. J. Och, and J. Dean, “Large language models in machine translation,” in In Proceedings of the Joint Conference on Empirical Methods in Natural Language Processing (EMNLP) and Computational Natural Language Learning (CoNLL), 2007. [9] J. Wu and S. Khudanpur, “Efficient training methods for maximum entropy language modeling.” in INTERSPEECH, 2000, pp. 114–118. [10] T. Alumae and M. Kurimo, “Efficient estimation of maxi- ¨ mum entropy language models with n-gram features: an srilm extension.” in INTERSPEECH, 2010, pp. 1820– 1823. [11] P. Xu, A. Gunawardana, and S. Khudanpur, “Efficient subsampling for training complex language models,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2011, pp. 1128–1136. [12] J. Wu and S. Khudanpur, “Building a topic-dependent maximum entropy model for very large corpora,” in Acoustics, Speech, and Signal Processing (ICASSP), 2002 IEEE International Conference on, vol. 1. IEEE, 2002, pp. I–777. [13] R. Sarikaya, M. Afify, Y. Deng, H. Erdogan, and Y. Gao, “Joint morphological-lexical language modeling for processing morphologically rich languages with application to dialectal arabic,” Audio, Speech, and Language Processing, IEEE Transactions on, vol. 16, no. 7, pp. 1330– 1339, 2008. [14] M. A. B. Shaik, A. E.-D. Mousa, R. Schluter, and H. Ney, ¨ “Feature-rich sub-lexical language models using a maximum entropy approach for german LVCSR,” in INTERSPEECH, 2013. [15] J. N. Tsitsiklis, D. P. Bertsekas, and M. Athans, “Distributed asynchronous deterministic and stochastic gradient optimization algorithms,” IEEE Transactions on Automatic Control, vol. 31:9, 1986. [16] K. Hall, S. Gilpin, and G. Mann, “Mapreduce/bigtable for distributed optimization,” in Neural Information Processing Systems Workshop on Leaning on Cores, Clusters, and Clouds, 2010. [17] R. McDonald, K. Hall, and G. Mann, “Distributed training strategies for the structured perceptron,” in Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, 2010, pp. 456–464. [18] M. Zinkevich, M. Weimer, A. Smola, and L. Li, “Parallelized stochastic gradient descent,” in Advances in Neural Information Processing Systems 23, J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R. Zemel, and A. Culotta, Eds., 2010, pp. 2595–2603. [19] F. Niu, B. Recht, C. Re, and S. J. Wright, “Hogwild: A ´ lock-free approach to parallelizing stochastic gradient descent,” in Advances in Neural Information Processing Systems, 2011. [20] P. Xu, A. Gunawardana, and S. Khudanpur, “Efficient subsampling for training complex language models.” in EMNLP. ACL, 2011, pp. 1128–1136. [Online]. Available: http://dblp.uni-trier.de/db/conf/emnlp/emnlp2011. html#XuGK11 [21] F. Morin and Y. Bengio, “Hierarchical probabilistic neural network language model,” in AISTATS05, 2005, pp. 246– 252. [22] Y. Tsuruoka, J. Tsujii, and S. Ananiadou, “Stochastic gradient descent training for l1-regularized log-linear models with cumulative penalty,” in Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL, 2009, pp. 477–485. [23] J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on large clusters,” in Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation - Volume 6, ser. OSDI’04, 2004, pp. 10–10. [24] J. Schalkwyk, D. Beeferman, F. Beaufays, B. Byrne, C. Chelba, M. Cohen, M. Kamvar, and B. Strope, “your word is my command: Google search by voice: A case study,” in Advances in Speech Recognition. Springer, 2010, pp. 61–90. [25] C. Allauzen and M. Riley, “Bayesian language model interpolation for mobile speech input.” in INTERSPEECH, 2011, pp. 1429–1432. 2649 Unsupervised Testing Strategies for ASR Brian Strope, Doug Beeferman, Alexander Gruenstein, Xin Lei Google, Inc. bps, dougb, alexgru, xinlei @google.com Abstract This paper describes unsupervised strategies for estimating relative accuracy differences between acoustic models or language models used for automatic speech recognition. To test acoustic models, the approach extends ideas used for unsupervised discriminative training to include a more explicit validation on held out data. To test language models, we use a dual interpretation of the same process, this time allowing us to measure differences by exploiting expected ‘truth gradients’ between strong and weak acoustic models. The paper shows correlations between supervised and unsupervised measures across a range of acoustic model and language model variations. We also use unsupervised tests to assess the non-stationary nature of mobile speech input. Index Terms: speech recognition, unsupervised testing, nonstationary distributions 1. Introduction Current commercial speech recognition systems can use years of unsupervised data to train relatively large, discriminatively optimized, acoustic models (AM). Similarly, web-scale text corpora for estimating language models (LM) are often available online, and unsupervised recognition results themselves can provide an additional source of LM training data. Since there is no human transcription in any of these steps, the remaining use for manual human transcription is for generating test sets, as a final sanity check for validating system parameters and models. In this paper, we augment that strategy with unsupervised evaluations and begin the discussion of whether eventually we might be able to get rid of the need for any explicit human transcription. The motivation for human transcription for testing is obvious. Despite steady advances and relative commercial successes, it is generally accepted that humans are much more accurate transcribers than automatic speech recognition systems [1]. While there are a few notable exceptions where machines were more accurate than humans [2], human transcription accuracy is so much better, we use it unquiestioningly as our best approximate for absolute truth. But there are equally obvious disadvantages to relying on human transcription. While it may feel premature, accepting human performance as absolute truth imposes an upper bound on accuracy. The absolute truth is not absolute, and so we’ll eventually have to figure out how to beat it. In fact with our current processes and tasks, below, we show that human transcribers can be only comparable in accuracy to current ASR systems. Absolute truth is already a problem. In response, we are improving transcription processes, but also considering unsupervised ways to augment traditional testing. Another obvious disadvantage of human transcription is that the tests themselves have to be limited in size and type. Even in a commercially successful research lab, getting extensive tests across every combination of speaker and channel type, recognition context, language, and time period is prohibitive. But a detailed characterization of those types of variations could help prioritize efforts. Similarly when tests are unsupervised, it is easier to update development and evaluation sets to avoid problems related to stale, over-fit tests. This is mostly an empirical paper. The next section describes some of the experiments we ran trying to assess our existing human transcription accuracy. Then we describe the generalizations of unsupervised discriminative training that enable a new evaluation strategy. Next the paper includes evaluations that show correlations between supervised and unsupervised tests, and concludes with unsupervised tests that start to characterize the non-stationary distribution of spoken data coming through Google mobile applications. 2. Problems with human transcriptions Recent efforts have begun to consider human transcription accuracy in the context of increased efficiency. These studies have generally shown that depending on the amount of effort, and the task, individual word error rates can vary from 2-15% [3, 4]. Ef- ficiency pressures on human transcription can lead to transcription noise and bias. 2.1. Early experiments Over the last few years we have seen several simple experiments not work: we have added matched data to our language models and seen error rates get worse; we have added unsupervised acoustic modeling data matched to a new fielded acoustic condition, and seen the error rates on new matched tests go up, but surprisingly, error rates on an old test, with slightly mismatched conditions, go down. For each of these, after tediously examining errors, we found the problem was that we typically “seed” our transcription process with the recognition result from the field. Mostly as a matter of expedience; it is easier for the transcriber to hit return than to type “home depot in palo alto california” yet again, and it can improve reliability since retyping can be error prone. But the power of the suggested transcription is also enough to bias the transcribers into rubber-stamping some of the fielded recognition results. When the transcriber rubber-stamps an error we potentially get penalized twice. The baseline gets credit where it should not, and a new system that corrects that error is falsely penalized for adding an error. The surprising improvement noted on the older, slightly mis-matched test happened because the transcriptions for the older test were seeded with transcriptions from an older system, decorrelating some of the transcription bias with the current baseline. In this case, transcription bias toward the baseline model was a bigger effect than the change in acoustics. Copyright © 2011 ISCA 28-31 August 2011, Florence, Italy INTERSPEECH 2011 16852.2. Multiple attempts To measure the human transcription accuracy more directly we started sending the same data for multiple attempts at human transcription, and we intentionally reduced the quality of our starting seeds to move any bias away from our best systems. For one test we sent 200K Voice Search utterances to be transcribed twice. Ignoring trivial differences like spaces, apostrophes, function words, and others, half of the transcripts agreed, which implies a sentence transcription accuracy of 71%, assuming independence of the attempts. Similarly when we sent the remaining 100K utterances, where transcriptions did not agree, back for two more attempts, we were still left with about 10% of the original set with 4 distinct human transcriptions. Again assuming independence, 10% disagreement in 4 attempts is consistent with 68% accuracy for each attempt. But we believe our system has a sentence accuracy higher than 70%. Looking through the errors many of the problems are related to cultural references, popular names, and businesses that are not obvious to everyone. The cultural and geographic requirements of the voice search task may be unusually difficult. It combines short utterances and wide open semantic contexts to generate surprisingly unfamiliar sounding speech. Finding ways to bring the correct cultural context to the transcriber is another obvious path to pursue. 3. Generalizing unsupervised discriminative training While some published results considered unsupervised maximum likelihood estimation of model parameters [5], many systems use unsupervised discriminative optimization, directly using recognizer output as input [6]. Cynically we might ask what we are learning if we are using the recognition result as truth for discriminatively optimizing its parameters. It is hard to imagine that we can fix the errors it makes, when we use the model to generate truth. But when we look into the details of commonly used discriminative training techniques based on maximum mutual information, we see that the LM used to generate competing hypotheses is not the same LM used to generate truth. To improve the generalization of discriminative training, we use a unigram to describe the space of potential errors [7], but a trigram or higher to give us transcription truth with unsupervised training. One interpretation of unsupervised discriminative training for acoustic models is that we are using the difference between a weak unigram and a relatively stronger trigram to give us a known improvement in relative truth. We do not know that the strong-LM (trigram) result is absolutely correct, we only know that it is better than the result with the weak LM (unigram). When there is a difference, if we can move toward the results of the strong-LM system by changing acoustic model parameters, then we are building a more accurate AM, that also helps with the final system using a stronger LM. With this interpretation, the AM learns from the ‘truth gradient’ between the strong and weak LMs. 3.1. Unsupervised AM testing Extending unsupervised discriminative AM training to unsupervised AM testing involves retesting the criterion used during training in a new test context. More prescriptively, we sample a new set of live data from production logs, and take the recognition result from the fielded system using a strong AM and LM as assumed truth. Then we re-recognize the same data using multiple strong acoustic models and a weak LM. If one of the systems using a weak LM can better approximate the system using a strong LM, then at a minimum, we can say that it is doing a better job of generalizing our training criteria to new data. More directly, we have evidence that one of the strong acoustic models could be more accurate than the rest. For scoring we are assuming truth from the fielded system, not a human transcriber. Therefore, when reporting unsupervised testing results, we count traditional word error rates, but because there is no human transcription, we report it as a word difference rate (WDR), to highlight that, for example, in the case of unsupervised AM tests, it is the word differences between the systems with the strong and weak LM. 3.2. Unsupervised LM testing To use the same strategy for LM testing we reverse the roles of the AM and the LM. For better generalization of discriminative AM testing, we used a weak LM to generate more competing alternates. That establishes a truth gradient that generally changes around 1/3 of the words. The dual for LM testing is to use a weak AM instead. To get a truth gradient of a similar magnitude with our systems, we backed off to a context-dependent acoustic model that uses around 1/10th the number of parameters of our strong models, and only uses maximum likelihood training. Then as above, we test with multiple strong LMs and assume that the LM that can move the results of the system using the weak AM closest to the results of the production system (with the strong AM), is the most accurate LM. With unsupervised LM testing we again report WDR and not WER, where the magnitude of the difference is now from the difference between the strong AM and the weak AM. 3.3. Relative measures In this paper we are ignoring the harder problem of measuring absolute accuracy. Instead we focus on relative differences between different acoustic or language models. Others have predicted absolute error measures using statistics from the training set as represented in the final acoustic models [8], without looking at testing data. But here we are interested in estimating relative performance across production data that was unseen during training. Our goal is to assess whether new models or new approaches are helping on new data, and whether the data might be changing from the distributions used during training. 4. Correlating supervised and unsupervised measures First we show that the performance on unsupervised offline tests for the AM and for the LM correlate with more traditional supervised tests. Our production data started with primarily Voice Search queries intended for google.com, but over time has included increasing amounts of general Voice Input traffic which includes a large fraction of short person-to-person messages. To start the analyses, we consider these data streams separately. For Voice Search, our traditional supervised test is built from the 200K utterance set that we sent for multiple transcriptions. For this test we exclude the 10% of the utterances where we got 4 distinct human transcriptions and sample a test set randomly from the remaining 90%. Similarly for the supervised 1686Voice Input test, we sent utterances twice and selected from the utterances with at least 80% agreement between human transcriptions. On the utterances where not all the words agreed, we randomly chose one of the human transcriptions as truth. This led to a test that excluded about 28% of the utterances. Both of these supervised tests are biased in that they only include the utterances that we could reliably transcribe. The Voice Search test has 27K utterances and 87K words. The Voice Input test has 49K utterances and 320K words. For the first unsupervised tests here, we sampled production logs for a single day of traffic. We found the median recognizer confidence for each task and then randomly selected a few hundred thousand utterances that were above median confidence for each task. For all unsupervised experiments we used the recognition results from the field as truth. Our recognition configuration for both systems is fairly standard and described in the literature. Specifically we use a PLP front-end [11] together with LDA and STC [12], and optimize our acoustic models using BMMI [13] on mostly unsupervised data mixed from both tasks. Our language models are n-grams, with Katz interpolation and entropy pruning, and the fielded Voice Input system also includes dynamic interpolation [14]. The Voice Search system used trigrams and the Voice Input system included 4-grams. 4.1. AM experiments The AM experiments use a weak LM (in this case a unigram) for each task estimated from the few hundred thousand high confidence utterances sampled for that day’s test. All the utterances in the test were also used to train the LM, so there is no OOV. This step is consistent with the matched unigram we train for discriminative acoustic model training. For Voice Search, the resulting unigram had 17K words, and for Voice Input there were 18K unique words. The acoustic models we tested here were trained using 11M (mostly unsupervised) utterances from a mix of both tasks. The parameter we vary for these experiments is the size of the acoustic models. We use the same decision tree and context state definitions for all models, but we vary the number of Gaussians assigned to each state. Each model is trained with the same number of iterations through all the data. The final model sizes range from 100K to 1M Gaussians. Decoder parameters are set in production mode, which generally means we lose around 0.5% absolute from the best possible accuracy to have faster than real-time search. # Gauss Sup VS Unsup VS Sup VI Unsup VI 100K 16.0 36.0 14.5 24.8 200K 15.3 34.4 13.6 22.8 340K 14.6 33.9 13.4 22.7 500K 14.3 33.3 13.2 22.3 1M 13.9 33.0 12.9 21.8 Table 1: WER in % on supervised (Sup) and WDR in % on unsupervised (Unsup) AM tests for Voice Search (VS) and Voice Input (VI). 4.2. LM experiments For the LM experiments we vary the number of n-grams used for the Voice Input task from around 2M to 30M by varying our final entropy pruning threshold. Unlike the production system used to generate truth for the unsupervised tests, for these tests the LM is a static n-gram. We show results with two different weak acoustic models (A/B). Condition A is a context-dependent model estimated using maximum likelihood criteria with 2 Gaussians per state for a total of 16K Gaussians. Condition B uses a similar model with a variable number of Gaussians across model states, and a total of 40K Gaussians. On supervised tests, these weak acoustic models have around two to three times the error rates of final strong production models. n-grams Sup PPL Sup WER Unsup A/B WDR 1.9M 109 15.2 38.1/25.9 3.8M 98 14.4 36.8/24.5 7.6M 92 14.1 36.0/23.8 15M 87 13.9 35.5/23.2 30M 85 13.7 35.1/22.8 Table 2: Comparing supervised (Sup) and unsupervised (Unsup) LM tests for Voice Input. WER/WDR are in %, PPL is perplexity. Unsup A and B are for different sized AMs. The relative improvement in both AM and LM experiments is consistently around 10% for a 10x increase in model size. Correlations between supervised and unsupervised tests range between 0.98 and 0.99. 5. Additional experiments Varying model size is a controlled way to generate accuracy differences. Here we include additional unsupervised measurements that show expected differences in the context of other AM and LM modeling efforts. 5.1. CMLLR To evaluate an implementation of constrained maximum likelihood linear regression [9] for adaptation, we started by testing with read speech corpora from several data collections [10] used to initialize acoustic models in a new context. With a large and regular amounts of acoustic data per speaker, we see the typical improvements of 6-10% relative, over a matched discriminative baseline. To estimate the accuracy impact of CMLLR on the production system, (where the actual distributions of amount of data per user is not imposed by the strict specifications of a data collection) we used unsupervised testing. Here we sampled all personalized users over a 30 day period, and measured the change in WDR with a weak LM and either the production AM or the production AM with CMLLR. Further we break the differences in WDR down by the amount of data available for each speaker. # Utts No Adapt Adapt 1-20 25.7 25.4 20-50 26.6 25.6 50-100 25.8 24.6 100-200 23.5 22.5 Table 3: WDR in % on adaptation tests. Input is binned by the number of utterances for a given user. From the table, it is clear that we are seeing a similar relative difference as we saw with more traditional read speech tests, and we are further able to characterize the expected satu- 1687ration of the relatively small number of parameters in CMLLR after around 20 voice input utterances. 5.2. LM update At one point we updated our language model to include a rescoring pass more explicitly matched to recent Voice Search queries. By testing this update with recent unsupervised tests we are able to show the expected win on new voice search type utterances. # Model Config Sup VS Unsup VS Original 14.6 30.0 Updated 14.6 28.6 Table 4: WER in % on supervised (Sup) and WDR in % unsupervised (Unsup) LM tests for Voice Search. One interpretation of these results is that we are updating the LM to better represent the recent query data which itself is better matched to the recent unsupervised test. It also suggests that the distribution of our data might be moving. 5.3. Estimating non-stationary distributions Finally we ran two sweeps of AM tests to estimate how stationary the acoustics for our system have been over the last 14 months. The first system is trained using the Voice Search supervised data available at the beginning of the 14 months, and the second uses only unsupervised data sampled from the last 3 months. Therefore, one model represents our initial estimate of the distribution, and the other approximates a most recent distribution. Both systems use around 350K gaussians. To evaluate the AM performance, we use a weak LM estimated from a year’s worth of production data. Figure 1: Change in WDR over time with two different AMs. Both lines show that the distribution of the data has shifted away from the original supervised data, and toward the recent unsupervised data. Additional unsupervised tests will illuminate the causes of this change in more detail. We currently suspect an increase in the fraction of voice input recognition, but it is already obvious that the distribution of the acoustics for this data is changing. The plot also suggests that with a single AM the change of WDR across conditions may also be informative. Note that since we are generalizing from the same criteria we used for AM training, and we are getting rid of some of the necessity of human transcription, we are concerned about converging away from reality. The ground is a little firmer for the LM side, since our current LM processes are in fact not yet learning from AM truth gradients the same way our unsupervised AM training learns from LM truth gradients. From the AM side, our current unsupervised tests are simply checking whether the training optimizations extend to unseen data. Pragmatically, because it is unsupervised we also have the opportunity to test that generalization with a range of weak LMs and with a range of input data, and thereby to increase our con- fidence in the generalization. Moreover, reducing the accuracy improvement provided by a strong LM seems like a safe requirement to impose on AM training. But from an experiment perspective, we have to remember what gradient we are exploiting and not cheat. In other words, augmenting the AM with features directly related to the strong LM would not lead to improvements. We also monitor coarse signals related to application use (counts of user actions in response to recognition results) to give us additional complimentary evidence of successful generalization. 6. Conclusions This paper extends unsupervised discriminative training to an unsupervised testing strategy suitable for evaluating AM and LM changes. We show strong correlations with traditional testing strategies when we change AM or LM model size. We also show expected gains on unsupervised measures with other types of AM and LM changes, and use the unsupervised measures to begin to characterize the stationarity of the input data to Google mobile. Together with unsupervised training, unsupervised testing enables development paths that no longer impose human performance as the upper bound for accuracy. 7. References [1] R. Lippmann, “Speech recognition by machines and humans,” Speech Communication, July 1997. [2] T. Kristjansson, J. Hershey, P. Olsen, S. Rennie, R. Gopinath, “Super-Human Multi-Talker Speech Recognition: The IBM 2006 Speech Separation Challenge System,” Proc. ICSLP, 2006. [3] S. Novotney, C. Callison-Burch, “Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription,” Proc. NAACL, 2010. [4] A. Gruenstein, I. McGraw, A. Sutherland, “A Self-Transcribing Speech Corpus: Collecting Continuous Speech with an Online Educational Game,” Proc. SLaTE, 2009. [5] J. Ma, R. Schwartz, “Unsupervised versus supervised training of acoustic models,” Proc. ICSLP, 2008. [6] L. Wang, M. Gales, P. Woodland, “Unsupervised Training for Mandarin Broadcast News Conversation Transcription,” Proc ICASSP, 2007. [7] P.C. Woodland, D. Povey, “Large scale discriminative training of hidden Markov models for speech recognition,” Comp. Speech & Lang., Jan. 2002. [8] Y. Deng, M. Mahajan, A. Acero, “Estimating Speech Recognition Error Rate without Acoustic Test Data,” Proc. Eurospeech, 2003. [9] M. J. F. Gales, “Maximum likelihood linear transformations for HMM-based speech recognition,” Comp. Speech & Lang., Vol 12.2 1998. [10] T. Hughes, K. Nakajima, L. Ha, A. Vasu, P. Moreno, M. LeBeau, “Building transcribed speech corpora quickly and cheaply for many languages,” Proc ICSLP, 2010. [11] H. Hermansky, “Perceptual linear predictive (PLP) analysis of speech,” JASA, v87.4, 1990. [12] M. Gales, “Semi-Tied Covariance Matrices for Hidden Markov Models,” Proc. IEEE Trans. SAP, May 2000. [13] D. Povey, D. Kanevsky, B. Kingsbury, B. Ramabhadran, G. Saon, K. Visweswariah, “Boosted MMI for model and feature-space discriminative training,” Proc. ICASSP, 2008. [14] B. Ballinger, C. Allauzen, A. Gruenstein, J. Schalkwyk, “OnDemand Language Model Interpolation for Mobile Speech Input,” Proc. ICSLP, 2010. 1688 Parallel Algorithms for Unsupervised Tagging Sujith Ravi Google Mountain View, CA 94043 sravi@google.com Sergei Vassilivitskii Google Mountain View, CA 94043 sergeiv@google.com Vibhor Rastogi∗ Twitter San Francisco, CA vibhor.rastogi@gmail.com Abstract We propose a new method for unsupervised tagging that finds minimal models which are then further improved by Expectation Maximization training. In contrast to previous approaches that rely on manually specified and multi-step heuristics for model minimization, our approach is a simple greedy approximation algorithm DMLC (DISTRIBUTEDMINIMUM-LABEL-COVER) that solves this objective in a single step. We extend the method and show how to ef- ficiently parallelize the algorithm on modern parallel computing platforms while preserving approximation guarantees. The new method easily scales to large data and grammar sizes, overcoming the memory bottleneck in previous approaches. We demonstrate the power of the new algorithm by evaluating on various sequence labeling tasks: Part-of-Speech tagging for multiple languages (including lowresource languages), with complete and incomplete dictionaries, and supertagging, a complex sequence labeling task, where the grammar size alone can grow to millions of entries. Our results show that for all of these settings, our method achieves state-of-the-art scalable performance that yields high quality tagging outputs. 1 Introduction Supervised sequence labeling with large labeled training datasets is considered a solved problem. For ∗The research described herein was conducted while the author was working at Google. instance, state of the art systems obtain tagging accuracies over 97% for part-of-speech (POS) tagging on the English Penn Treebank. However, learning accurate taggers without labeled data remains a challenge. The accuracies quickly drop when faced with data from a different domain, language, or when there is very little labeled information available for training (Banko and Moore, 2004). Recently, there has been an increasing amount of research tackling this problem using unsupervised methods. A popular approach is to learn from POS-tag dictionaries (Merialdo, 1994), where we are given a raw word sequence and a dictionary of legal tags for each word type. Learning from POStag dictionaries is still challenging. Complete wordtag dictionaries may not always be available for use and in every setting. When they are available, the dictionaries are often noisy, resulting in high tagging ambiguity. Furthermore, when applying taggers in new domains or different datasets, we may encounter new words that are missing from the dictionary. There have been some efforts to learn POS taggers from incomplete dictionaries by extending the dictionary to include these words using some heuristics (Toutanova and Johnson, 2008) or using other methods such as type-supervision (Garrette and Baldridge, 2012). In this work, we tackle the problem of unsupervised sequence labeling using tag dictionaries. The first reported work on this problem was on POS tagging from Merialdo (1994). The approach involved training a standard Hidden Markov Model (HMM) using the Expectation Maximization (EM) algorithm (Dempster et al., 1977), though EM does notperform well on this task (Johnson, 2007). More recent methods have yielded better performance than EM (see (Ravi and Knight, 2009) for an overview). One interesting line of research introduced by Ravi and Knight (2009) explores the idea of performing model minimization followed by EM training to learn taggers. Their idea is closely related to the classic Minimum Description Length principle for model selection (Barron et al., 1998). They (1) formulate an objective function to find the smallest model that explains the text (model minimization step), and then, (2) fit the minimized model to the data (EM step). For POS tagging, this method (Ravi and Knight, 2009) yields the best performance to date; 91.6% tagging accuracy on a standard test dataset from the English Penn Treebank. The original work from (Ravi and Knight, 2009) uses an integer linear programming (ILP) formulation to find minimal models, an approach which does not scale to large datasets. Ravi et al. (2010b) introduced a two-step greedy approximation to the original objective function (called the MIN-GREEDY algorithm) that runs much faster while maintaining the high tagging performance. Garrette and Baldridge (2012) showed how to use several heuristics to further improve this algorithm (for instance, better choice of tag bigrams when breaking ties) and stack other techniques on top, such as careful initialization of HMM emission models which results in further performance gains. Their method also works under incomplete dictionary scenarios and can be applied to certain low-resource scenarios (Garrette and Baldridge, 2013) by combining model minimization with supervised training. In this work, we propose a new scalable algorithm for performing model minimization for this task. By making an assumption on the structure of the solution, we prove that a variant of the greedy set cover algorithm always finds an approximately optimal label set. This is in contrast to previous methods that employ heuristic approaches with no guarantee on the quality of the solution. In addition, we do not have to rely on ad hoc tie-breaking procedures or careful initializations for unknown words. Finally, not only is the proposed method approximately optimal, it is also easy to distribute, allowing it to easily scale to very large datasets. We show empirically that our method, combined with an EM training step outperforms existing state of the art systems. 1.1 Our Contributions • We present a new method, DISTRIBUTED MINIMUM LABEL COVER, DMLC, for model minimization that uses a fast, greedy algorithm with formal approximation guarantees to the quality of the solution. • We show how to efficiently parallelize the algorithm while preserving approximation guarantees. In contrast, existing minimization approaches cannot match the new distributed algorithm when scaling from thousands to millions or even billions of tokens. • We show that our method easily scales to both large data and grammar sizes, and does not require the corpus or label set to fit into memory. This allows us to tackle complex tagging tasks, where the tagset consists of several thousand labels, which results in more than one million entires in the grammar. • We demonstrate the power of the new method by evaluating under several different scenarios—POS tagging for multiple languages (including low-resource languages), with complete and incomplete dictionaries, as well as a complex sequence labeling task of supertagging. Our results show that for all these settings, our method achieves state-of-the-art performance yielding high quality taggings. 2 Related Work Recently, there has been an increasing amount of research tackling this problem from multiple directions. Some efforts have focused on inducing POS tag clusters without any tags (Christodoulopoulos et al., 2010; Reichart et al., 2010; Moon et al., 2010), but evaluating such systems proves dif- ficult since it is not straightforward to map the cluster labels onto gold standard tags. A more popular approach is to learn from POS-tag dictionaries (Merialdo, 1994; Ravi and Knight, 2009), incomplete dictionaries (Hasan and Ng, 2009; Garrette and Baldridge, 2012) and human-constructed dictionaries (Goldberg et al., 2008).Another direction that has been explored in the past includes bootstrapping taggers for a new language based on information acquired from other languages (Das and Petrov, 2011) or limited annotation resources (Garrette and Baldridge, 2013). Additional work focused on building supervised taggers for noisy domains such as Twitter (Gimpel et al., 2011). While most of the relevant work in this area centers on POS tagging, there has been some work done for building taggers for more complex sequence labeling tasks such as supertagging (Ravi et al., 2010a). Other related work include alternative methods for learning sparse models via priors in Bayesian inference (Goldwater and Griffiths, 2007) and posterior regularization (Ganchev et al., 2010). But these methods only encourage sparsity and do not explicitly seek to minimize the model size, which is the objective function used in this work. Moreover, taggers learned using model minimization have been shown to produce state-of-the-art results for the problems discussed here. 3 Model Following Ravi and Knight (2009), we formulate the problem as that of label selection on the sentence graph. Formally, we are given a set of sequences, S = {S1, S2, . . . , Sn} where each Si is a sequence of words, Si = wi1, wi2, . . . , wi,|Si| . With each word wij we associate a set of possible tags Tij . We will denote by m the total number of (possibly duplicate) words (tokens) in the corpus. Additionally, we define two special words w0 and w∞ with special tags start and end, and consider the modified sequences S 0 i = w0, Si , w∞. To simplify notation, we will refer to w∞ = w|Si|+1. The sequence label problem asks us to select a valid tag tij ∈ Tij for each word wij in the input to minimize a specific objective function. We will refer to a tag pair (ti,j−1, tij ) as a label. Our aim is to minimize the number of distinct labels used to cover the full input. Formally, given a sequence S 0 i and a tag tij for each word wij in S 0 i , let the induced set of labels for sequence S 0 i be Li = |S 0 i [ | j=1 {(ti,j−1, tij )}. The total number of distinct labels used over all sequences is then φ = ∪i Li | = [ i |S [i|+1 j=1 {(ti,j−1, tij )}|. Note that the order of the tokens in the label makes a difference as {(NN, VP)} and {(VP, NN)} are two distinct labels. Now we can define the problem formally, following (Ravi and Knight, 2009). Problem 1 (Minimum Label Cover). Given a set S of sequences of words, where each word wij has a set of valid tags Tij , the problem is to find a valid tag assignment tij ∈ Tij for each word that minimizes the number of distinct labels or tag pairs over all sequences, φ = S i S|Si|+1 j=1 {(ti,j−1, tij )}| . The problem is closely related to the classical Set Cover problem and is also NP-complete. To reduce Set Cover to the label selection problem, map each element i of the Set Cover instance to a single word sentence Si = wi1, and let the valid tags Ti1 contain the names of the sets that contain element i. Consider a solution to the label selection problem; every sentence Si is covered by two labels (w0, ki) and (ki , w∞), for some ki ∈ Ti1, which corresponds to an element i being covered by set ki in the Set Cover instance. Thus any valid solution to the label selection problem leads to a feasible solution to the Set Cover problem ({k1, k2, . . .}) of exactly half the size. Finally, we will use {{. . .}} notation to denote a multiset of elements, i.e. a set where an element may appear multiple times. 4 Algorithm In this Section, we describe the DISTRIBUTEDMINIMUM-LABEL-COVER, DMLC, algorithm for approximately solving the minimum label cover problem. We describe the algorithm in a centralized setting, and defer the distributed implementation to Section 5. Before describing the algorithm, we briefly explain the relationship of the minimum label cover problem to set cover. 4.1 Modification of Set Cover As we pointed out earlier, the minimum label cover problem is at least as hard as the Set Cover prob-1: Input: A set of sequences S with each words wij having possible tags Tij . 2: Output: A tag assignment tij ∈ Tij for each word wij approximately minimizing labels. 3: Let M be the multi set of all possible labels generated by choosing each possible tag t ∈ Tij . M = [ i   |S [i|+1 j=1 [ t 0∈Ti,j−1 t∈Tij {{(t 0 , t)}}   (1) 4: Let L = ∅ be the set of selected labels. 5: repeat 6: Select the most frequent label not yet selected: (t 0 , t) = arg max(s 0 ,s)∈L/ |M ∩ (s 0 , s)|. 7: For each bigram (wi,j−1, wij ) where t 0 ∈ Ti,j−1 and t ∈ Tij tentatively assign t 0 to wi,j−1 and t to wij . Add (t 0 , t) to L. 8: If a word gets two assignments, select one at random with equal probability. 9: If a bigram (wij , wi,j+1) is consistent with assignments in (t, t0 ), fix the tentative assignments, and set Ti,j−1 = {t 0} and Tij = t. Recompute M, the multiset of possible labels, with the updated Ti,j−1 and Tij . 10: until there are no unassigned words Algorithm 1: MLC Algorithm 1: Input: A set of sequences S with each words wij having possible tags Tij . 2: Output: A tag assignment tij ∈ Tij for each word wij approximately minimizing labels. 3: (Graph Creation) Initialize each vertex vij with the set of possible tags Tij and its neighbors vi,j+1 and vi,j−1. 4: repeat 5: (Message Passing) Each vertex vij sends its possibly tags Tij to its forward neighbor vij+1. 6: (Counter Update) Each vertex receives the the tags Ti,j−1 and adds all possible labels {(s, s0 )|s ∈ Ti,j−1, s0 ∈ Tij} to a global counter (M). 7: (MaxLabel Selection) Each vertex queries the global counter M to find the maximum label (t, t0 ). 8: (Tentative Assignment) Each vertex vij selects a tag tentatively as follows: If one of the tags t, t0 is in the feasible set Tij , it tentatively selects the tag. 9: (Random Assignment) If both are feasible it selects one at random. The vertex communicates its assignment to its neighbors. 10: (Confirmed Assignment) Each vertex receives the tentative assignment from its neighbors. If together with its neighbors it can match the selected label, the assignment is finalized. If the assigned tag is T, then the vertex vij sets the valid tag set Tij to {t}. 11: until no unassigned vertices exist. Algorithm 2: DMLC Implementation lem. An additional challenge comes from the fact that labels are tags for a pair of words, and hence are related. For example, if we label a word pair (wi,j−1, wij ) as (NN, VP), then the label for the next word pair (wij , wi,j+1) has to be of the form (VP, *), i.e., it has to start with VP. Previous work (Ravi et al., 2010a; Ravi et al., 2010b) recognized this challenge and employed two phase heuristic approaches. Eschewing heuristics, we will show that with one natural assumption, even with this extra set of constraints, the standard greedy algorithm for this problem results in a solution with a provable approximation ratio of O(log m). In practice, however, the algorithm performs far better than the worst case ratio, and similar to the work of (Gomes et al., 2006), we find that the greedy approach selects a cover approximately 11% worse than the optimum solution. 4.2 MLC Algorithm We present in Algorithm 1 our MINIMUM LABEL COVER algorithm to approximately solve the minimum label cover problem. The algorithm is simple, efficient, and easy to distribute. The algorithm chooses labels one at a time, selecting a label that covers as many words as possible inevery iteration. For this, it generates and maintains a multi-set of all possible labels M (Step 3). The multi-set contains an occurrence of each valid label, for example, if wi,j−1 has two possible valid tags NN and VP, and wij has one possible valid tag VP, then M will contain two labels, namely (NN, VP) and (VP, VP). Since M is a multi-set it will contain duplicates, e.g. the label (NN, VP) will appear for each adjacent pair of words that have NN and VP as valid tags, respectively. In each iteration, the algorithm picks a label with the most number of occurrences in M and adds it to the set of chosen labels (Step 6). Intuitively, this is a greedy step to select a label that covers the most number of word pairs. Once the algorithm picks a label (t 0 , t), it tries to assign as many words to tags t or t 0 as possible (Step 7). A word can be assigned t 0 if t 0 is a valid tag for it, and t a valid tag for the next word in sequence. Similarly, a word can be assigned t, if t is a valid tag for it, and t 0 a valid tag for the previous word. Some words can get both assignments, in which case we choose one tentatively at random (Step 8). If a word’s tentative random tag, say t, is consistent with the choices of its adjacent words (say t 0 from the previous word), then the tentative choice is fixed as a permanent one. Whenever a tag is selected, the set of valid tags Tij for the word is reduced to a singleton {t}. Once the set of valid tags Tij changes, the multi-set M of all possible labels also changes, as seen from Eq 1. The multi-set is then recomputed (Step 9) and the iterations repeated until all of words have been tagged. We can show that under a natural assumption this simple algorithm is approximately optimal. Assumption 1 (c-feasibility). Let c ≥ 1 be any number, and k be the size of the optimal solution to the original problem. In each iteration, the MLC algorithm fixes the tags for some words. We say that the algorithm is c-feasible, if after each iteration there exists some solution to the remaining problem, consistent with the chosen tags, with size at most ck . The assumption encodes the fact that a single bad greedy choice is not going to destroy the overall structure of the solution, and a nearly optimal solution remains. We note that this assumption of cfeasibility is not only sufficient, as we will formally show, but is also necessary. Indeed, without any assumptions, once the algorithm fixes the tag for some words, an optimal label may no longer be consistent with the chosen tags, and it is not hard to find contrived examples where the size of the optimal solution doubles after each iteration of MLC. Since the underlying problem is NP-complete, it is computationally hard to give direct evidence verifying the assumption on natural language inputs. However, on small examples we are able to show that the greedy algorithm is within a small constant factor of the optimum, specifically it is within 11% of the optimum model size for the POS tagging problem using the standard 24k dataset (Ravi and Knight, 2009). Combined with the fact that the final method outperforms state of the art approaches, this leads us to conclude that the structural assumption is well justified. Lemma 1. Under the assumption of c-feasibility, the MLC algorithm achieves a O(c log m) approximation to the minimum label cover problem, where m = P i |Si | is the total number of tokens. Proof. To prove the Lemma we will define an objective function φ¯, counting the number of unlabeled word pairs, as a function of possible labels, and show that φ¯ decreases by a factor of (1−O(1/ck)) at every iteration. To define φ¯, we first define φ, the number of labeled word pairs. Consider a particular set of labels, L = {L1, L2, . . . , Lk} where each label is a pair (ti , tj ). Call {tij} a valid assignment of tokens if for each wij , we have tij ∈ Tij . Then the score of L under an assignment t, which we denote by φt , is the number of bigram labels that appear in L. Formally, φt(L) = | ∪i,j {{(ti,j−1, tij ) ∩ L}}|. Finally, we define φ(L) to be the best such assignment, φ(L) = maxt φt(L), and φ¯(L) = m − φ(L) the number of uncovered labels. Consider the label selected by the algorithm in every step. By the c-feasibility assumption, there exists some solution having ck labels. Thus, some label from that solution covers at least a 1/ck fraction of the remaining words. The selected label (t, t0 ) maximizes the intersection with the remaining feasible labels. The conflict resolution step ensures that in expectation the realized benefit is at least a half of the maximum, thereby reducing φ¯ by at least a(1 − 1/2ck) fraction. Therefore, after O(kc log m) operations all of the labels are covered. 4.3 Fitting the Model Using EM Once the greedy algorithm terminates and returns a minimized grammar of tag bigrams, we follow the approach of Ravi and Knight (2009) and fit the minimized model to the data using the alternating EM strategy. In this step, we run an alternating optimization procedure iteratively in phases. In each phase, we initialize (and prune away) parameters within the two HMM components (transition or emission model) using the output from the previous phase. We initialize this procedure by restricting the transition parameters to only those tag bigrams selected in the model minimization step. We train in conjunction with the original emission model using EM algorithm which prunes away some of the emission parameters. In the next phase, we alternate the initialization by choosing the pruned emission model along with the original transition model (with full set of tag bigrams) and retrain using EM. The alternating EM iterations are terminated when the change in the size of the observed grammar (i.e., the number of unique bigrams in the tagging output) is ≤ 5%. 1 We refer to our entire approach using greedy minimization followed by EM training as DMLC + EM. 5 Distributed Implementation The DMLC algorithm is directly suited towards parallelization across many machines. We turn to Pregel (Malewicz et al., 2010), and its open source version Giraph (Apa, 2013). In these systems the computation proceeds in rounds. In every round, every machine does some local processing and then sends arbitrary messages to other machines. Semantically, we think of the communication graph as fixed, and in each round each vertex performs some local computation and then sends messages to its neighbors. This mode of parallel programming directs the programmers to “Think like a vertex.” The specific systems like Pregel and Giraph build infrastructure that ensures that the overall system 1 For more details on the alternating EM strategy and how initialization with minimized models improve EM performance in alternating iterations, refer to (Ravi and Knight, 2009). is fault tolerant, efficient, and fast. In addition, they provide implementation of commonly used distributed data structures, such as, for example global counters. The programmer’s job is simply to specify the code that each vertex will run at every round. We implemented the DMLC algorithm in Pregel. The implementation is straightforward and given in Algorithm 2. The multi-set M of Algorithm 1 is represented as a global counter in Algorithm 2. The message passing (Step 3) and counter update (Step 4) steps update this global counter and hence perform the role of Step 3 of Algorithm 1. Step 5 selects the label with largest count, which is equivalent to the greedy label picking step 6 of Algorithm 1. Finally steps 6, 7, and 8 update the tag assignment of each vertex performing the roles of steps 7, 8, and 9, respectively, of Algorithm 1. 5.1 Speeding up the Algorithm The implementation described above directly copies the sequential algorithm. Here we describe additional steps we took to further improve the parallel running times. Singleton Sets: As the parallel algorithm proceeds, the set of feasible sets associated with a node slowly decreases. At some point there is only one tag that a node can take on, however this tag is rare, and so it takes a while for it to be selected using the greedy strategy. Nevertheless, if a node and one of its neighbors have only a single tag left, then it is safe to assign the unique label 2 . Modifying the Graph: As is often the case, the bottleneck in parallel computations is the communication. To reduce the amount of communication we reduce the graph on the fly, removing nodes and edges once they no longer play a role in the computation. This simple modification decreases the communication time in later rounds as the total size of the problem shrinks. 6 Experiments and Results In this Section, we describe the experimental setup for various tasks, settings and compare empirical performance of our method against several existing 2We must judiciously initialize the global counter to take care of this assignment, but this is easily accomplished.baselines. The performance results for all systems (on all tasks) are measured in terms of tagging accuracy, i.e. % of tokens from the test corpus that were labeled correctly by the system. 6.1 Part-of-Speech Tagging Task 6.1.1 Tagging Using a Complete Dictionary Data: We use a standard test set (consisting of 24,115 word tokens from the Penn Treebank) for the POS tagging task. The tagset consists of 45 distinct tag labels and the dictionary contains 57,388 word/tag pairs derived from the entire Penn Treebank. Per-token ambiguity for the test data is about 1.5 tags/token. In addition to the standard 24k dataset, we also train and test on larger data sets— 973k tokens from the Penn Treebank, 3M tokens from PTB+Europarl (Koehn, 2005) data. Methods: We evaluate and compare performance for POS tagging using four different methods that employ the model minimization idea combined with EM training: • EM: Training a bigram HMM model using EM algorithm (Merialdo, 1994). • ILP + EM: Minimizing grammar size using integer linear programming, followed by EM training (Ravi and Knight, 2009). • MIN-GREEDY + EM: Minimizing grammar size using the two-step greedy method (Ravi et al., 2010b). • DMLC + EM: This work. Results: Table 1 shows the results for POS tagging on English Penn Treebank data. On the smaller test datasets, all of the model minimization strategies (methods 2, 3, 4) tend to perform equally well, yielding state-of-the-art results and large improvement over standard EM. When training (and testing) on larger corpora sizes, DMLC yields the best reported performance on this task to date. A major advantage of the new method is that it can easily scale to large corpora sizes and the distributed nature of the algorithm still permits fast, efficient optimization of the global objective function. So, unlike the earlier methods (such as MIN-GREEDY) it is fast enough to run on several millions of tokens to yield additional performance gains (shown in last column). Speedups: We also observe a significant speedup when using the parallelized version of the DMLC algorithm. Performing model minimization on the 24k tokens dataset takes 55 seconds on a single machine, whereas parallelization permits model minimization to be feasible even on large datasets. Fig 1 shows the running time for DMLC when run on a cluster of 100 machines. We vary the input data size from 1M word tokens to about 8M word tokens, while holding the resources constant. Both the algorithm and its distributed implementation in DMLC are linear time operations as evident by the plot. In fact, for comparison, we also plot a straight line passing through the first two runtimes. The straight line essentially plots runtimes corresponding to a linear speedup. DMLC clearly achieves better runtimes showing even better than linear speedup. The reason for this is that distributed version has a constant overhead for initialization, independent of the data size. While the running time for rest of the implementation is linear in data size. Thus, as the data size becomes larger, the constant overhead becomes less significant, and the distributed implementation appears to complete slightly faster as data size increases. Figure 1: Runtime vs. data size (measured in # of word tokens) on 100 machines. For comparison, we also plot a straight line passing through the first two runtimes. The straight line essentially plots runtimes corresponding to a linear speedup. DMLC clearly achieves better runtimes showing a better than linear speedup. 6.1.2 Tagging Using Incomplete Dictionaries We also evaluate our approach for POS tagging under other resource-constrained scenarios. Obtain-Method Tagging accuracy (%) te=24k te=973k tr=24k tr=973k tr=3.7M 1. EM 81.7 82.3 2. ILP + EM (Ravi and Knight, 2009) 91.6 3. MIN-GREEDY + EM (Ravi et al., 2010b) 91.6 87.1 4. DMLC + EM (this work) 91.4 87.5 87.8 Table 1: Results for unsupervised part-of-speech tagging on English Penn Treebank dataset. Tagging accuracies for different methods are shown on multiple datasets. te shows the size (number of tokens) in the test data, tr represents the size of the raw text used to perform model minimization. ing a complete dictionary is often difficult, especially for new domains. To verify the utility of our method when the input dictionary is incomplete, we evaluate against standard datasets used in previous work (Garrette and Baldridge, 2012) and compare against the previous best reported performance for the same task. In all the experiments (described here and in subsequent sections), we use the following terminology—raw data refers to unlabeled text used by different methods (for model minimization or other unsupervised training procedures such as EM), dictionary consists of word/tag entries that are legal, and test refers to data over which tagging evaluation is performed. English Data: For English POS tagging with incomplete dictionary, we evaluate on the Penn Treebank (Marcus et al., 1993) data. Following (Garrette and Baldridge, 2012), we extracted a word-tag dictionary from sections 00-15 (751,059 tokens) consisting of 39,087 word types, 45,331 word/tag entries, a per-type ambiguity of 1.16 yielding a pertoken ambiguity of 2.21 on the raw corpus (treating unknown words as having all 45 possible tags). As in their setup, we then use the first 47,996 tokens of section 16 as raw data and perform final evaluation on the sections 22-24. We use the raw corpus along with the unlabeled test data to perform model minimization and EM training. Unknown words are allowed to have all possible tags in both these procedures. Italian Data: The minimization strategy presented here is a general-purpose method that does not require any specific tuning and works for other languages as well. To demonstrate this, we also perform evaluation on a different language (Italian) using the TUT corpus (Bosco et al., 2000). Following (Garrette and Baldridge, 2012), we use the same data splits as their setting. We take the first half of each of the five sections to build the word-tag dictionary, the next quarter as raw data and the last quarter as test data. The dictionary was constructed from 41,000 tokens comprised of 7,814 word types, 8,370 word/tag pairs, per-type ambiguity of 1.07 and a per-token ambiguity of 1.41 on the raw data. The raw data consisted of 18,574 tokens and the test contained 18,763 tokens. We use the unlabeled corpus from the raw and test data to perform model minimization followed by unsupervised EM training. Other Languages: In order to test the effectiveness of our method in other non-English settings, we also report the performance of our method on several other Indo-European languages using treebank data from CoNLL-X and CoNLL-2007 shared tasks on dependency parsing (Buchholz and Marsi, 2006; Nivre et al., 2007). The corpus statistics for the five languages (Danish, Greek, Italian, Portuguese and Spanish) are listed below. For each language, we construct a dictionary from the raw training data. The unlabeled corpus from the raw training and test data is used to perform model minimization followed by unsupervised EM training. As before, unknown words are allowed to have all possible tags. We report the final tagging performance on the test data and compare it to baseline EM. Garrette and Baldridge (2012) treat unknown words (words that appear in the raw text but are missing from the dictionary) in a special manner and use several heuristics to perform better initialization for such words (for example, the probability that an unknown word is associated with a particular tag isconditioned on the openness of the tag). They also use an auto-supervision technique to smooth counts learnt from EM onto new words encountered during testing. In contrast, we do not apply any such technique for unknown words and allow them to be mapped uniformly to all possible tags in the dictionary. For this particular set of experiments, the only difference from the Garrette and Baldridge (2012) setup is that we include unlabeled text from the test data (but without any dictionary tag labels or special heuristics) to our existing word tokens from raw text for performing model minimization. This is a standard practice used in unsupervised training scenarios (for example, Bayesian inference methods) and in general for scalable techniques where the goal is to perform inference on the same data for which one wishes to produce some structured prediction. Language Train Dict Test (tokens) (entries) (tokens) DANISH 94386 18797 5852 GREEK 65419 12894 4804 ITALIAN 71199 14934 5096 PORTUGUESE 206678 30053 5867 SPANISH 89334 17176 5694 Results: Table 2 (column 2) compares previously reported results against our approach for English. We observe that our method obtains a huge improvement over standard EM and gets comparable results to the previous best reported scores for the same task from (Garrette and Baldridge, 2012). It is encouraging to note that the new system achieves this performance without using any of the carefully-chosen heuristics employed by the previous method. However, we do note that some of these techniques can be easily combined with our method to produce further improvements. Table 2 (column 3) also shows results on Italian POS tagging. We observe that our method achieves significant improvements in tagging accuracy over all the baseline systems including the previous best system (+2.9%). This demonstrates that the method generalizes well to other languages and produces consistent tagging improvements over existing methods for the same task. Results for POS tagging on CoNLL data in five different languages are displayed in Figure 2. Note that the proportion of raw data in test versus train 50 60 70 80 90 DANISH GREEK ITALIAN PORTUGUESE SPANISH 79.4 66.3 84.6 80.1 83.1 77.8 65.6 82 78.5 81.3 EM DMLC+EM Figure 2: Part-of-Speech tagging accuracy for different languages on CoNLL data using incomplete dictionaries. (from the standard CoNLL shared tasks) is much smaller compared to the earlier experimental settings. In general, we observe that adding more raw data for EM training improves the tagging quality (same trend observed earlier in Table 1: column 2 versus column 3). Despite this, DMLC + EM still achieves significant improvements over the baseline EM system on multiple languages (as shown in Figure 2). An additional advantage of the new method is that it can easily scale to larger corpora and it produces a much more compact grammar that can be efficiently incorporated for EM training. 6.1.3 Tagging for Low-Resource Languages Learning part-of-speech taggers for severely lowresource languages (e.g., Malagasy) is very challenging. In addition to scarce (token-supervised) labeled resources, the tag dictionaries available for training taggers are tiny compared to other languages such as English. Garrette and Baldridge (2013) combine various supervised and semi-supervised learning algorithms into a common POS tagger training pipeline to addresses some of these challenges. They also report tagging accuracy improvements on low-resource languages when using the combined system over any single algorithm. Their system has four main parts, in order: (1) Tag dictionary expansion using label propagation algorithm, (2) Weighted model minimization, (3) Expectation maximization (EM) training of HMMs using auto-supervision, (4) MaxEnt Markov Model (MEMM) training. The entire procedure results in a trained tagger model that can then be applied to tag any raw data.3 Step 2 in this procedure involves 3 For more details, refer (Garrette and Baldridge, 2013). "We consider a model of repeated online auctions in which an ad with an uncertain click-through rate faces a random distribution of competing bids in each auction and there is discounting of payoffs. We formulate the optimal solution to this explore/exploit problem as a dynamic programming problem and show that efficiency is maximized by making a bid for each advertiser equal to the advertiser's expected value for the advertising opportunity plus a term proportional to the variance in this value divided by the number of impressions the advertiser has received thus far. We then use this result to illustrate that the value of incorporating active exploration into a machine learning system in an auction environment is exceedingly small." Accepted for publication in the Annals of Applied Statistics (in press), 09/2014 INFERRING CAUSAL IMPACT USING BAYESIAN STRUCTURAL TIME-SERIES MODELS By Kay H. Brodersen, Fabian Gallusser, Jim Koehler, Nicolas Remy, and Steven L. Scott Google, Inc. E-mail: kbrodersen@google.com Abstract An important problem in econometrics and marketing is to infer the causal impact that a designed market intervention has exerted on an outcome metric over time. This paper proposes to infer causal impact on the basis of a diffusion-regression state-space model that predicts the counterfactual market response in a synthetic control that would have occurred had no intervention taken place. In contrast to classical difference-in-differences schemes, statespace models make it possible to (i) infer the temporal evolution of attributable impact, (ii) incorporate empirical priors on the parameters in a fully Bayesian treatment, and (iii) flexibly accommodate multiple sources of variation, including local trends, seasonality, and the time-varying influence of contemporaneous covariates. Using a Markov chain Monte Carlo algorithm for posterior inference, we illustrate the statistical properties of our approach on simulated data. We then demonstrate its practical utility by estimating the causal effect of an online advertising campaign on search-related site visits. We discuss the strengths and limitations of state-space models in enabling causal attribution in those settings where a randomised experiment is unavailable. The CausalImpact R package provides an implementation of our approach. 1. Introduction. This article proposes an approach to inferring the causal impact of a market intervention, such as a new product launch or the onset of an advertising campaign. Our method generalizes the widely used ‘difference-in-differences’ approach to the time-series setting by explicitly modelling the counterfactual of a time series observed both before and after the intervention. It improves on existing methods in two respects: it provides a fully Bayesian time-series estimate for the effect; and it uses model averaging to construct the most appropriate synthetic control for modelling the counterfactual. The CausalImpact R package provides an implementation of our approach (http://google.github.io/CausalImpact/). Inferring the impact of market interventions is an important and timely Keywords and phrases: causal inference, counterfactual, synthetic control, observational, difference in differences, econometrics, advertising, market research 12 K.H. BRODERSEN ET AL. problem. Partly because of recent interest in ‘big data,’ many firms have begun to understand that a competitive advantage can be had by systematically using impact measures to inform strategic decision making. An example is the use of ‘A/B experiments’ to identify the most effective market treatments for the purpose of allocating resources (Danaher and Rust, 1996; Seggie, Cavusgil and Phelan, 2007; Leeflang et al., 2009; Stewart, 2009). Here, we focus on measuring the impact of a discrete marketing event, such as the release of a new product, the introduction of a new feature, or the beginning or end of an advertising campaign, with the aim of measuring the event’s impact on a response metric of interest (e.g., sales). The causal impact of a treatment is the difference between the observed value of the response and the (unobserved) value that would have been obtained under the alternative treatment, i.e., the effect of treatment on the treated (Rubin, 1974; Hitchcock, 2004; Morgan and Winship, 2007; Rubin, 2007; Cox and Wermuth, 2001; Heckman and Vytlacil, 2007; Antonakis et al., 2010; Kleinberg and Hripcsak, 2011; Hoover, 2012; Claveau, 2012). In the present setting the response variable is a time series, so the causal effect of interest is the difference between the observed series and the series that would have been observed had the intervention not taken place. A powerful approach to constructing the counterfactual is based on the idea of combining a set of candidate predictor variables into a single ‘synthetic control’ (Abadie and Gardeazabal, 2003; Abadie, Diamond and Hainmueller, 2010). Broadly speaking, there are three sources of information available for constructing an adequate synthetic control. The first is the time-series behaviour of the response itself, prior to the intervention. The second is the behaviour of other time series that were predictive of the target series prior to the intervention. Such control series can be based, for example, on the same product in a different region that did not receive the intervention, or on a metric that reflects activity in the industry as a whole. In practice, there are often many such series available, and the challenge is to pick the relevant subset to use as contemporaneous controls. This selection is done on the pre-treatment portion of potential controls; but their value for predicting the counterfactual lies in their post-treatment behaviour. As long as the control series received no intervention themselves, it is often reasonable to assume the relationship between the treatment and the control series that existed prior to the intervention to continue afterwards. Thus, a plausible estimate of the counterfactual time series can be computed up to the point in time where the relationship between treatment and controls can no longer be assumed to be stationary, e.g., because one of the controls received treatment itself. In a Bayesian framework, a third source ofBAYESIAN CAUSAL IMPACT ANALYSIS 3 information for inferring the counterfactual is the available prior knowledge about the model parameters, as elicited, for example, by previous studies. We combine the three preceding sources of information using a statespace time-series model, where one component of state is a linear regression on the contemporaneous predictors. The framework of our model allows us to choose from among a large set of potential controls by placing a spikeand-slab prior on the set of regression coefficients, and by allowing the model to average over the set of controls (George and McCulloch, 1997). We then compute the posterior distribution of the counterfactual time series given the value of the target series in the pre-intervention period, along with the values of the controls in the post-intervention period. Subtracting the predicted from the observed response during the post-intervention period gives a semiparametric Bayesian posterior distribution for the causal effect (Figure 1). Related work. As with other domains, causal inference in marketing requires subtlety. Marketing data are often observational and rarely follow the ideal of a randomised design. They typically exhibit a low signal-to-noise ratio. They are subject to multiple seasonal variations, and they are often confounded by the effects of unobserved variables and their interactions (for recent examples, see Seggie, Cavusgil and Phelan, 2007; Stewart, 2009; Leeflang et al., 2009; Takada and Bass, 1998; Chan et al., 2010; Lewis and Reiley, 2011; Lewis, Rao and Reiley, 2011; Vaver and Koehler, 2011, 2012). Rigorous causal inferences can be obtained through randomised experiments, which are often implemented in the form of geo experiments (Vaver and Koehler, 2011, 2012). Many market interventions, however, fail to satisfy the requirements of such approaches. For instance, advertising campaigns are frequently launched across multiple channels, online and offline, which precludes measurement of individual exposure. Campaigns are often targeted at an entire country, and one country only, which prohibits the use of geographic controls within that country. Likewise, a campaign might be launched in several countries but at different points in time. Thus, while a large control group may be available, the treatment group often consists of no more than one region, or a few regions with considerable heterogeneity among them. A standard approach to causal inference in such settings is based on a linear model of the observed outcomes in the treatment and control group before and after the intervention. One can then estimate the difference between (i) the pre-post difference in the treatment group and (ii) the pre-post difference in the control group. The assumption underlying such differencein-differences (DD) designs is that the level of the control group provides4 K.H. BRODERSEN ET AL. 20 60 100 140 (a) Y Model fit Prediction X1 X2 -40 0 20 60 (b) Point-wise impact Ground truth -2000 0 2000 6000 (c) 2013-01 2013-02 2013-03 2013-04 2013-05 2013-06 2013-07 2013-08 2013-09 2013-10 2013-11 2013-12 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 Cumulative impact Ground truth Figure 1. Inferring causal impact through counterfactual predictions. (a) Simulated trajectory of a treated market (Y ) with an intervention beginning in January 2014. Two other markets (X1, X2) were not subject to the intervention and allow us to construct a synthetic control (cf. Abadie and Gardeazabal, 2003; Abadie, Diamond and Hainmueller, 2010). Inverting the state-space model described in the main text yields a prediction of what would have happened in Y had the intervention not taken place (posterior predictive expectation of the counterfactual with pointwise 95% posterior probability intervals). (b) The difference between observed data and counterfactual predictions is the inferred causal impact of the intervention. Here, predictions accurately reflect the true (Gamma-shaped) impact. A key characteristic of the inferred impact series is the progressive widening of the posterior intervals (shaded area). This effect emerges naturally from the model structure and agrees with the intuition that predictions should become increasingly uncertain as we look further and further into the (retrospective) future. (c) Another way of visualizing posterior inferences is by means of a cumulative impact plot. It shows, for each day, the summed effect up to that day. Here, the 95% credible interval of the cumulative impact crosses the zeroline about five months after the intervention, at which point we would no longer declare a significant overall effect.BAYESIAN CAUSAL IMPACT ANALYSIS 5 an adequate proxy for the level that would have been observed in the treatment group in the absence of treatment (see Lester, 1946; Campbell, Stanley and Gage, 1963; Ashenfelter and Card, 1985; Card and Krueger, 1993; Angrist and Krueger, 1999; Athey and Imbens, 2002; Abadie, 2005; Meyer, 1995; Shadish, Cook and Campbell, 2002; Donald and Lang, 2007; Angrist and Pischke, 2008; Robinson, McNulty and Krasno, 2009; Antonakis et al., 2010). DD designs have been limited in three ways. First, DD is traditionally based on a static regression model that assumes i.i.d. data despite the fact that the design has a temporal component. When fit to serially correlated data, static models yield overoptimistic inferences with too narrow uncertainty intervals (see also Solon, 1984; Hansen, 2007a,b; Bertrand, Duflo and Mullainathan, 2002). Second, most DD analyses only consider two time points: before and after the intervention. In practice, the manner in which an effect evolves over time, especially its onset and decay structure, is often a key question. Third, when DD analyses are based on time series, previous studies have imposed restrictions on the way in which a synthetic control is constructed from a set of predictor variables, which is something we wish to avoid. For example, one strategy (Abadie and Gardeazabal, 2003; Abadie, Diamond and Hainmueller, 2010) has been to choose a convex combination (w1, . . . , wJ ), wj ≥ 0, P wj = 1 of J predictor time series in such a way that a vector of pre-treatment variables (not time series) X1 characterising the treated unit before the intervention is matched most closely by the combination of pre-treatment variables X0 of the control units w.r.t. a vector of importance weights (v1, . . . , vJ ). These weights are themselves determined in such a way that the combination of pre-treatment outcome time series of the control units most closely matches the pre-treatment outcome time series of the treated unit. Such a scheme relies on the availability of interpretable characteristics (e.g., growth predictors), and it precludes non-convex combinations of controls when constructing the weight vector W. We prefer to select a combination of control series without reference to external characteristics and purely in terms of how well they explain the pre-treatment outcome time series of the treated unit (while automatically balancing goodness of fit and model complexity through the use of regularizing priors). Another idea (Belloni et al., 2013) has been to use classical variable-selection methods (such as the Lasso) to find a sparse set of predictors. This approach, however, ignores posterior uncertainty about both which predictors to use and their coefficients. The limitations of DD schemes can be addressed by using state-space6 K.H. BRODERSEN ET AL. models, coupled with highly flexible regression components, to explain the temporal evolution of an observed outcome. State-space models distinguish between a state equation that describes the transition of a set of latent variables from one time point to the next and an observation equation that specifies how a given system state translates into measurements. This distinction makes them extremely flexible and powerful (see Leeflang et al., 2009, for a discussion in the context of marketing research). The approach described in this paper inherits three main characteristics from the state-space paradigm. First, it allows us to flexibly accommodate different kinds of assumptions about the latent state and emission processes underlying the observed data, including local trends and seasonality. Second, we use a fully Bayesian approach to inferring the temporal evolution of counterfactual activity and incremental impact. One advantage of this is the flexibility with which posterior inferences can be summarised. Third, we use a regression component that precludes a rigid commitment to a particular set of controls by integrating out our posterior uncertainty about the influence of each predictor as well as our uncertainty about which predictors to include in the first place, which avoids overfitting. The remainder of this paper is organised as follows. Section 2 describes the proposed model, its design variations, the choice of diffuse empirical priors on hyperparameters, and a stochastic algorithm for posterior inference based on Markov chain Monte Carlo (MCMC). Section 3 demonstrates important features of the model using simulated data, followed by an application in Section 4 to an advertising campaign run by one of Google’s advertisers. Section 5 puts our approach into context and discusses its scope of application. 2. Bayesian structural time-series models. Structural time-series models are state-space models for time-series data. They can be defined in terms of a pair of equations yt = Z T (2.1) t αt + t αt+1 = Ttαt + Rtηt (2.2) , where t ∼ N (0, σ2 t ) and ηt ∼ N (0, Qt) are independent of all other unknowns. Equation (2.1) is the observation equation; it links the observed data yt to a latent d-dimensional state vector αt . Equation (2.2) is the state equation; it governs the evolution of the the state vector αt through time. In the present paper, yt is a scalar observation, Zt is a d-dimensional output vector, Tt is a d × d transition matrix, Rt is a d × q control matrix, t is a scalar observation error with noise variance σt , and ηt is a q-dimensionalBAYESIAN CAUSAL IMPACT ANALYSIS 7 system error with a q × q state-diffusion matrix Qt , where q ≤ d. Writing the error structure of equation (2.2) as Rtηt allows us to incorporate state components of less than full rank; a model for seasonality will be the most important example. Structural time-series models are useful in practice because they are flexible and modular. They are flexible in the sense that a very large class of models, including all ARIMA models, can be written in the state-space form given by (2.1) and (2.2). They are modular in the sense that the latent state as well as the associated model matrices Zt , Tt , Rt , and Qt can be assembled from a library of component sub-models to capture important features of the data. There are several widely used state-component models for capturing the trend, seasonality, or effects of holidays. A common approach is to assume the errors of different state-component models to be independent (i.e., Qt is block-diagonal). The vector αt can then be formed by concatenating the individual state components, while Tt and Rt become block-diagonal matrices. The most important state component for the applications considered in this paper is a regression component that allows us to obtain counterfactual predictions by constructing a synthetic control based on a combination of markets that were not treated. Observed responses from such markets are important because they allow us to explain variance components in the treated market that are not readily captured by more generic seasonal submodels. This approach assumes that covariates are unaffected by the effects of treatment. For example, an advertising campaign run in the United States might spill over to Canada or the United Kingdom. When assuming the absence of spill-over effects, the use of such indirectly affected markets as controls would lead to pessimistic inferences; that is, the effect of the campaign would be underestimated (cf. Meyer, 1995). 2.1. Components of state. Local linear trend. The first component of our model is a local linear trend, defined by the pair of equations µt+1 = µt + δt + ηµ,t δt+1 = δt + ηδ,t (2.3) where ηµ,t ∼ N (0, σ2 µ ) and ηδ,t ∼ N (0, σ2 δ ). The µt component is the value of the trend at time t. The δt component is the expected increase in µ between times t and t + 1, so it can be thought of as the slope at time t.8 K.H. BRODERSEN ET AL. The local linear trend model is a popular choice for modelling trends because it quickly adapts to local variation, which is desirable when making short-term predictions. This degree of flexibility may not be desired when making longer-term predictions, as such predictions often come with implausibly wide uncertainty intervals. There is a generalization of the local linear trend model where the slope exhibits stationarity instead of obeying a random walk. This model can be written as µt+1 = µt + δt + ηµ,t δt+1 = D + ρ(δt − D) + ηδ,t, (2.4) where the two components of η are independent. In this model, the slope of the time trend exhibits AR(1) variation around a long-term slope of D. The parameter |ρ| < 1 represents the learning rate at which the local trend is updated. Thus, the model balances short-term information with information from the distant past. Seasonality. There are several commonly used state-component models to capture seasonality. The most frequently used model in the time domain is (2.5) γt+1 = − S X−2 s=0 γt−s + ηγ,t, where S represents the number of seasons, and γt denotes their joint contribution to the observed response yt . The state in this model consists of the S − 1 most recent seasonal effects, but the error term is a scalar, so the evolution equation for this state model is less than full rank. The mean of γt+1 is such that the total seasonal effect is zero when summed over S seasons. For example, if we set S = 4 to capture four seasons per year, the mean of the winter coefficient will be −1 × (spring + summer + autumn). The part of the transition matrix Tt representing the seasonal model is an S−1 × S−1 matrix with −1’s along the top row, 1’s along the subdiagonal, and 0’s elsewhere. The preceding seasonal model can be generalized to allow for multiple seasonal components with different periods. When modelling daily data, for example, we might wish to allow for an S = 7 day-of-week effect, as well as an S = 52 weekly annual cycle. The latter can be handled by setting Tt = IS−1, with zero variance on the error term, when t is not the start of a new week, and setting Tt to the usual seasonal transition matrix, with nonzero error variance, when t is the start of a new week.BAYESIAN CAUSAL IMPACT ANALYSIS 9 Contemporaneous covariates with static coefficients. Control time series that received no treatment are critical to our method for obtaining accurate counterfactual predictions since they account for variance components that are shared by the series, including in particular the effects of other unobserved causes otherwise unaccounted for by the model. A natural way of including control series in the model is through a linear regression. Its coefficients can be static or time-varying. A static regression can be written in state-space form by setting Zt = β Txt and αt = 1. One advantage of working in a fully Bayesian treatment is that we do not need to commit to a fixed set of covariates. The spike-andslab prior described in Section 2.2 allows us to integrate out our posterior uncertainty about which covariates to include and how strongly they should influence our predictions, which avoids overfitting. All covariates are assumed to be contemporaneous; the present model does not infer on a potential lag between treated and untreated time series. A known lag, however, can be easily incorporated by shifting the corresponding regressor in time. Contemporaneous covariates with dynamic coefficients. An alternative to the above is a regression component with dynamic regression coefficients to account for time-varying relationships (e.g., Banerjee, Kauffman and Wang, 2007; West and Harrison, 1997). Given covariates j = 1 . . . J, this introduces the dynamic regression component x T t βt = X J j=1 xj,tβj,t βj,t+1 = βj,t + ηβ,j,t, (2.6) where ηβ,j,t ∼ N (0, σ2 βj ). Here, βj,t is the coefficient for the j th control series and σβj is the standard deviation of its associated random walk. We can write the dynamic regression component in state-space form by setting Zt = xt and αt = βt and by setting the corresponding part of the transition matrix to Tt = IJ×J , with Qt = diag(σ 2 βj ). Assembling the state-space model. Structural time-series models allow us to examine the time series at hand and flexibly choose appropriate components for trend, seasonality, and either static or dynamic regression for the controls. The presence or absence of seasonality, for example, will usually be obvious by inspection. A more subtle question is whether to choose static or dynamic regression coefficients. When the relationship between controls and treated unit has been stable in the past, static coefficients are an attractive option. This is because10 K.H. BRODERSEN ET AL. a spike-and-slab prior can be implemented efficiently within a forward- filtering, backward-sampling framework. This makes it possible to quickly identify a sparse set of covariates even from tens or hundreds of potential variables (Scott and Varian, 2013). Local variability in the treated time series is captured by the dynamic local level or dynamic linear trend component. Covariate stability is typically high when the available covariates are close in nature to the treated metric. The empirical analyses presented in this paper, for example, will be based on a static regression component (Section 4). This choice provides a reasonable compromise between capturing local behaviour and accounting for regression effects. An alternative would be to use dynamic regression coefficients, as we do, for instance, in our analyses of simulated data (Section 3). Dynamic coefficients are useful when the linear relationship between treated metrics and controls is believed to change over time. There are a number of ways of reducing the computational burden of dealing with a potentially large number of dynamic coefficients. One option is to resort to dynamic latent factors, where one uses xt = But +νt with dim(ut)  J and uses ut instead of xt as part of Zt in (2.1), coupled with an AR-type model for ut itself. Another option is latent thresholding regression, where one uses a dynamic version of the spike-and-slab prior as in Nakajima and West (2013). The state-component models are assembled independently, with each component providing an additive contribution to yt . Figure 2 illustrates this process assuming a local linear trend paired with a static regression component. 2.2. Prior distributions and prior elicitation. Let θ generically denote the set of all model parameters and let α = (α1, . . . , αm) denote the full state sequence. We adopt a Bayesian approach to inference by specifying a prior distribution p(θ) on the model parameters as well as a distribution p(α0|θ) on the initial state values. We may then sample from p(α, θ|y) using MCMC. Most of the models in Section 2.1 depend solely on a small set of variance parameters that govern the diffusion of the individual state components. A typical prior distribution for such a variance is (2.7) 1 σ 2 ∼ G  ν 2 , s 2  , where G (a, b) is the Gamma distribution with expectation a/b. The prior parameters can be interpreted as a prior sum of squares s, so that s/ν is a prior estimate of σ 2 , and ν is the weight, in units of prior sample size, assigned to the prior estimate.BAYESIAN CAUSAL IMPACT ANALYSIS 11 𝜇1 𝑦1 𝒩 𝑦𝑡 𝜇𝑡 + 𝑥𝑡 T𝛽𝜚, 𝜎𝑦 2 𝜇𝑛 … 𝑦𝑛 𝜇𝑛+1 𝑦 𝑛+1 𝜇𝑚 𝑦 𝑚 𝑥1 𝑥𝑛 𝑥𝑛+1 𝑥𝑚 𝒩 𝜇𝑡 𝜇𝑡−1 + 𝛿𝑡−1, 𝜎𝜇 2 𝒩 𝛿𝑡 𝛿𝑡−1, 𝜎𝛿 2 … … … … … 𝜎𝜇 𝜎𝛿 𝜎𝑦 𝜇0 local trend local level observed (𝑦) and counterfactual (𝑦 ) activity controls control selection pre-intervention period post-intervention period 𝛿1 𝛿𝑛+1 𝛿𝑚 𝜚 diffusion parameters 𝒩 𝛽𝜚 𝑏𝜚, 𝜎𝜖 2 Σ𝜚 −1 −1 𝛿0 𝛿𝑛 𝜎𝜖 𝛽𝜚 regression coefficients observation noise Figure 2. Graphical model for the static-regression variant of the proposed state-space model. Observed market activity y1:n = (y1, . . . , yn) is modelled as the result of a latent state plus Gaussian observation noise with error standard deviation σy. The state αt includes a local level µt, a local linear trend δt, and a set of contemporaneous covariates xt, scaled by regression coefficients β%. State components are assumed to evolve according to independent Gaussian random walks with fixed standard deviations σµ and σδ (conditionaldependence arrows shown for the first time point only). The model includes empirical priors on these parameters and the initial states. In an alternative formulation, the regression coefficients β are themselves subject to random-walk diffusion (see main text). Of principal interest is the posterior predictive density over the unobserved counterfactual responses y˜n+1, . . . , y˜m. Subtracting these from the actual observed data yn+1, . . . , ym yields a probability density over the temporal evolution of causal impact.12 K.H. BRODERSEN ET AL. We often have a weak default prior belief that the incremental errors in the state process are small, which we can formalize by choosing small values of ν (e.g., 1) and small values of s/ν. The notion of ‘small’ means different things in different models; for the seasonal and local linear trend models our default priors are 1/σ2 ∼ G(10−2 , 10−2 s 2 y ), where s 2 y = P t (yt − y¯) 2/(n − 1) is the sample variance of the target series. Scaling by the sample variance is a minor violation of the Bayesian paradigm, but it is an effective means of choosing a reasonable scale for the prior. It is similar to the popular technique of scaling the data prior to analysis, but we prefer to do the scaling in the prior so we can model the data on its original scale. When faced with many potential controls, we prefer letting the model choose an appropriate set. This can be achieved by placing a spike-andslab prior over coefficients (George and McCulloch, 1993, 1997; Polson and Scott, 2011; Scott and Varian, 2013). A spike-and-slab prior combines point mass at zero (the ‘spike’), for an unknown subset of zero coefficients, with a weakly informative distribution on the complementary set of non-zero coefficients (the ‘slab’). Contrary to what its name might suggest, the ‘slab’ is usually not completely flat, but rather a Gaussian with a large variance. Let % = (%1, . . . , %J ), where %j = 1 if βj 6= 0 and %j = 0 otherwise. Let β% denote the non-zero elements of the vector β and let Σ−1 % denote the rows and columns of Σ−1 corresponding to non-zero entries in %. We can then factorize the spike-and-slab prior as (2.8) p(%, β, 1/σ2  ) = p(%) p(σ 2  |%) p(β%|%, σ2  ). The spike portion of (2.8) can be an arbitrary distribution over {0, 1} J in principle; the most common choice in practice is a product of independent Bernoulli distributions, (2.9) p(%) = Y J j=1 π %j j (1 − πj ) 1−%j , where πj is the prior probability of regressor j being included in the model. Values for πj can be elicited by asking about the expected model size M, and then setting all πj = M/J. An alternative is to use a more specific set of values πj . In particular, one might choose to set certain πj to either 1 or 0 to force the corresponding variables into or out of the model. Generally, framing the prior in terms of expected model size has the advantage that the model can adapt to growing numbers of predictor variables without having to switch to a hierarchical prior (Scott and Berger, 2010).BAYESIAN CAUSAL IMPACT ANALYSIS 13 For the ‘slab’ portion of the prior we use a conjugate normal-inverse Gamma distribution, β%|σ 2  ∼ N  b%, σ2  (Σ−1 % ) −1  (2.10) 1 σ 2  ∼ G  ν 2 , s 2  (2.11) . The vector b in equation (2.10) encodes our prior expectation about the value of each element of β. In practice, we usually set b = 0. The prior parameters in equation (2.11) can be elicited by asking about the expected R2 ∈ [0, 1] as well as the number of observations worth of weight ν the prior estimate should be given. Then s = ν(1 − R2 )s 2 y . The final prior parameter in (2.10) is Σ−1 which, up to a scaling factor, is the prior precision over β in the full model, with all variables included. The total information in the covariates is XTX, and so 1 nXTX is the average information in a single observation. Zellner’s g-prior (Zellner, 1986; Chipman et al., 2001; Liang et al., 2008) sets Σ−1 = g nXTX, so that g can be interpreted as g observations worth of information. Zellner’s prior becomes improper when XTX is not positive definite; we therefore ensure propriety by averaging XTX with its diagonal, (2.12) Σ−1 = g n n wXTX + (1 − w) diag  XTX o with default values of g = 1 and w = 1/2. Overall, this prior specification provides a broadly useful default while providing considerable flexibility in those cases where more specific prior information is available. 2.3. Inference. Posterior inference in our model can be broken down into three pieces. First, we simulate draws of the model parameters θ and the state vector α given the observed data y1:n in the training period. Second, we use the posterior simulations to simulate from the posterior predictive distribution p(y˜n+1:m|y1:n) over the counterfactual time series y˜n+1:m given the observed pre-intervention activity y1:n. Third, we use the posterior predictive samples to compute the posterior distribution of the pointwise impact yt−y˜t for each t = 1, . . . , m. We use the same samples to obtain the posterior distribution of cumulative impact. Posterior simulation. We use a Gibbs sampler to simulate a sequence (θ, α) (1) ,(θ, α) (2) , . . . from a Markov chain whose stationary distribution is p(θ, α|y1:n). The sampler alternates between: a data-augmentation step that simulates from p(α|y1:n, θ); and a parameter-simulation step that simulates from p(θ|y1:n, α).14 K.H. BRODERSEN ET AL. The data-augmentation step uses the posterior simulation algorithm from Durbin and Koopman (2002), providing an improvement over the earlier forward-filtering, backward-sampling algorithms by Carter and Kohn (1994), Fr¨uhwirth-Schnatter (1994), and de Jong and Shephard (1995). In brief, because p(y1:n, α|θ) is jointly multivariate normal, the variance of p(α|y1:n, θ) does not depend on y1:n. We can therefore simulate (y ∗ 1:n , α∗ ) ∼ p(y1:n, α|θ) and subtract E(α ∗ |y ∗ 1:n , θ) to obtain zero-mean noise with the correct variance. Adding E(α|y1:n, θ) restores the correct mean, which completes the draw. The required expectations can be computed using the Kalman filter and a fast mean smoother described in detail by Durbin and Koopman (2002). The result is a direct simulation from p(α|y1:n, θ) in an algorithm that is linear in the total (pre- and post-intervention) number of time points (m), and quadratic in the dimension of the state space (d). Given the draw of the state, the parameter draw is straightforward for all state components other than the static regression coefficients β. All state components that exclusively depend on variance parameters can translate their draws back to error terms ηt , accumulate sums of squares of η, and because of conjugacy with equation (2.7) the posterior distribution will remain Gamma distributed. The draw of the static regression coefficients β proceeds as follows. For each t = 1, . . . , n in the pre-intervention period, let ˙yt denote yt with the contributions from the other state components subtracted away, and let y˙ 1:n = ( ˙y1, . . . , y˙n). The challenge is to simulate from p(%, β, σ2  |y˙ 1:n), which we can factor into p(%|y1:n)p(1/σ2  |%, y˙ 1:n)p(β|%, σ, y˙ 1:n). Because of conjugacy, we can integrate out β and 1/σ2  and be left with (2.13) %|y˙ 1:n ∼ C(y˙ 1:n) |Σ −1 % | 1 2 |V −1 % | 1 2 p(%) S N 2 −1 % , where C(y˙ 1:n) is an unknown normalizing constant. The sufficient statistics in equation (2.13) are V −1 % =  XTX  % + Σ−1 % β˜ % = (V −1 % ) −1 (XT % y˙ 1:n + Σ−1 % b%) N = ν + n S% = s + y˙ T 1:ny˙ 1:n + b T % Σ −1 % b% − β˜T % V −1 % β˜ %. To sample from (2.13), we use a Gibbs sampler that draws each %j given all other %−j . Each full-conditional is easy to evaluate because %j can only assume two possible values. It should be noted that the dimension of all matrices in (2.13) is P j %j , which is small if the model is truly sparse. There are many matrices to manipulate, but because each is small the overallBAYESIAN CAUSAL IMPACT ANALYSIS 15 algorithm is fast. Once the draw of % is complete, we sample directly from p(β, 1/σ2  |%, y˙ 1:n) using standard conjugate formulae. For an alternative that may be even more computationally efficient, see Ghosh and Clyde (2011). Posterior predictive simulation. While the posterior over model parameters and states p(θ, α|y1:n) can be of interest in its own right, causal impact analyses are primarily concerned with the posterior incremental effect, (2.14) p (y˜n+1:m | y1:n, x1:m). As shown by its indices, the density in equation (2.14) is defined precisely for that portion of the time series which is unobserved: the counterfactual market response ˜yn+1, . . . , y˜m that would have been observed in the treated market, after the intervention, in the absence of treatment. It is also worth emphasizing that the density is conditional on the observed data (as well as the priors) and only on these, i.e., on activity in the treatment market before the beginning of the intervention as well as activity in all control markets both before and during the intervention. The density is not conditioned on parameter estimates or the inclusion or exclusion of covariates with static regression coefficients, all of which have been integrated out. Thus, through Bayesian model averaging, we neither commit to any particular set of covariates, which helps avoid an arbitrary selection; nor to point estimates of their coefficients, which prevents overfitting. The posterior predictive density in (2.14) is defined as a coherent (joint) distribution over all counterfactual data points, rather than as a collection of pointwise univariate distributions. This ensures that we correctly propagate the serial structure determined on pre-intervention data to the trajectory of counterfactuals. This is crucial, in particular, when forming summary statistics, such as the cumulative effect of the intervention on the treatment market. Posterior inference was implemented in C++ with an R interface. Given a typically-sized dataset with m = 500 time points, J = 10 covariates, and 10,000 iterations (see Section 4 for an example), this implementation takes less than 30 seconds to complete on a standard computer, enabling nearinteractive analyses. 2.4. Evaluating impact. Samples from the posterior predictive distribution over counterfactual activity can be readily used to obtain samples from the posterior causal effect, i.e., the quantity we are typically interested in. For each draw τ and for each time point t = n + 1, . . . , m, we set φ (τ) t := yt − y˜ (τ) t (2.15) ,16 K.H. BRODERSEN ET AL. yielding samples from the approximate posterior predictive density of the effect attributed to the intervention. In addition to its pointwise impact, we often wish to understand the cumulative effect of an intervention over time. One of the main advantages of a sampling approach to posterior inference is the flexibility and ease with which such derived inferences can be obtained. Reusing the impact samples obtained in (2.15), we compute for each draw τ X t t 0=n+1 φ (τ) t (2.16) 0 ∀t = n + 1, . . . , m. The preceding cumulative sum of causal increments is a useful quantity when y represents a flow quantity, measured over an interval of time (e.g., a day), such as the number of searches, sign-ups, sales, additional installs, or new users. It becomes uninterpretable when y represents a stock quantity, usefully defined only for a point in time, such as the total number of clients, users, or subscribers. In this case we might instead choose, for each τ , to draw a sample of the posterior running average effect following the intervention, 1 t − n X t t 0=n+1 φ (τ) t (2.17) 0 ∀t = n + 1, . . . , m. Unlike the cumulative effect in (2.16), the running average is always interpretable, regardless of whether it refers to a flow or a stock. However, it is more context-dependent on the length of the post-intervention period under consideration. In particular, under the assumption of a true impact that grows quickly at first and then declines to zero, the cumulative impact approaches its true total value (in expectation) as we increase the counterfactual forecasting period, whereas the average impact will eventually approach zero (while, in contrast, the probability intervals diverge in both cases, leading to more and more uncertain inferences as the forecasting period increases). 3. Application to simulated data. To study the characteristics of our approach, we analysed simulated (i.e., computer-generated) data across a series of independent simulations. Generated time series started on 1 January 2013 and ended on 30 June 2014, with a perturbation beginning on 1 January 2014. The data were simulated using a dynamic regression component with two covariates whose coefficients evolved according to independent random walks, βt ∼ N (βt−1, 0.012 ), initialized at β0 = 1. The covariates themselves were simple sinusoids with wavelengths of 90 days and 360 days, respectively.BAYESIAN CAUSAL IMPACT ANALYSIS 17 12 14 16 18 20 22 24 (a) 2013-01 2013-02 2013-03 2013-04 2013-05 2013-06 2013-07 2013-08 2013-09 2013-10 2013-11 2013-12 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 observed predicted true (b) effect size (%) proportion of intervals excluding zero 0 0.1 1 10 25 50 0.0 0.2 0.4 0.6 0.8 1.0 50 100 150 0.0 0.2 0.4 0.6 0.8 1.0 (c) campaign duration (days) proportion of intervals containing truth Figure 3. Adequacy of posterior uncertainty. (a) Example of one of the 256 datasets created to assess estimation accuracy. Simulated observations (black) are based on two contemporaneous covariates, scaled by time-varying coefficients plus a time-varying local level (not shown). During the campaign period, where the data are lifted by an effect size of 10%, the plot shows the posterior expectation of counterfactual activity (blue), along with its pointwise central 95% credible intervals (blue shaded area), and, for comparison, the true counterfactual (green). (b) Power curve. Following repeated application of the model to simulated data, the plot shows the empirical frequency of concluding that a causal effect was present, as a function of true effect size, given a post-intervention period of 6 months. The curve represents sensitivity in those parts of the graph where the true effect size is positive, and 1 − specificity where the true effect size is zero. Error bars represent 95% credible intervals for the true sensitivity, using a uniform Beta(1, 1) prior. (c) Interval coverage. Using an effect size of 10%, the plot shows the proportion of simulations in which the pointwise central 95% credible interval contained the true impact, as a function of campaign duration. Intervals should contain ground truth in 95% of simulations, however much uncertainty its predictions may be associated with. Error bars represent 95% credible intervals. The latent state underlying the observed data was generated using a local level that evolved according to a random walk, µt ∼ N (µt−1, 0.1 2 ), initialized at µ0 = 0. Independent observation noise was sampled using t ∼ N (0, 0.1 2 ). In summary, observations yt were generated using yt = βt,1zt,1 + βt,2zt,2 + µt + t . To simulate the effect of advertising, the post-intervention portion of the preceding series was multiplied by 1 +e, where e (not to be confused with ) represented the true effect size specifying the (uniform) relative lift during the campaign period. An example is shown in Figure 3a. Sensitivity and specificity. To study the properties of our model, we began by considering under what circumstances we successfully detected a causal18 K.H. BRODERSEN ET AL. effect, i.e., the statistical power or sensitivity of our approach. A related property is the probability of not detecting an absent impact, i.e., specificity. We repeatedly generated data, as described above, under different true effect sizes. We then computed the posterior predictive distribution over the counterfactuals, and recorded whether or not we would have concluded a causal effect. For each of the effect sizes 0%, 0.1%, 1%, 10%, and 100%, a total of 2 8 = 256 simulations were run. This number was chosen simply on the grounds that it provided reasonably tight intervals around the reported summary statistics without requiring excessive amounts of computation. In each simulation, we concluded that a causal effect was present if and only if the central 95% posterior probability interval of the cumulative effect excluded zero. The model used throughout this section comprised two structural blocks. The first one was a local level component. We placed an inverse-Gamma prior on its diffusion variance with a prior estimate of s/ν = 0.1σy and a prior sample size ν = 32. The second structural block was a dynamic regression component. We placed a Gamma prior with prior expectation 0.1σy on the diffusion variance of both regression coefficients. By construction, the outcome variable did not exhibit any local trends or seasonality other than the variation conveyed through the covariates. This obviated the need to include an explicit local linear trend or seasonality component in the model. In a first analysis, we considered the empirical proportion of simulations in which a causal effect had been detected. When taking into account only those simulations where the true effect size was greater than zero, these empirical proportions provide estimates of the sensitivity of the model w.r.t. the process by which the data were generated. Conversely, those simulations where the campaign had had no effect yield an estimate of the model’s specificity. In this way, we obtained the power curve shown in Figure 3b. The curve shows that, in data such as these, a market perturbation leading to a lift no larger than 1% is missed in about 90% of cases. By contrast, a perturbation that lifts market activity by 25% is correctly detected as such in most cases. In a second analysis, we assessed the coverage properties of the posterior probability intervals obtained through our model. It is desirable to use a diffuse prior on the local level component such that central 95% intervals contain ground truth in about 95% of the simulations. This coverage frequency should hold regardless of the length of the campaign period. In other words, a longer campaign should lead to posterior intervals that are appropriately widened to retain the same coverage probability as the nar-BAYESIAN CAUSAL IMPACT ANALYSIS 19 rower intervals obtained for shorter campaigns. This was approximately the case throughout the simulated campaign (Figure 3c). Estimation accuracy. To study the accuracy of the point estimates supported by our approach, we repeated the preceding simulations with a fixed effect size of 10% while varying the length of the campaign. When given a quadratic loss function, the loss-minimizing point estimate is the posterior expectation of the predictive density over counterfactuals. Thus, for each generated dataset i, we computed the expected causal effect for each time point, φˆ i,t := hφt | y1, . . . , ym, x1, . . . , xmi ∀t = n + 1, . . . , m; i = 1, . . . , 256. (3.1) To quantify the discrepancy between estimated and true impact, we calculated the absolute percentage estimation error, ai,t := φˆ i,t − φt φt (3.2) . This yielded an empirical distribution of absolute percentage estimation errors (Figure 4a; blue), showing that impact estimates become less and less accurate as the forecasting period increases. This is because, under the local linear trend model in (2.3), the true counterfactual activity becomes more and more likely to deviate from its expected trajectory. It is worth emphasizing that all preceding results are based on the assumption that the model structure remains intact throughout the modelling period. In other words, even though the model is built around the idea of multiple (non-stationary) components (i.e., a time-varying local trend and, potentially, time-varying regression coefficients), this structure itself remains unchanged. If the model structure does change, estimation accuracy may suffer. We studied the impact of a changing model structure in a second simulation in which we repeated the procedure above in such a way that 90 days after the beginning of the campaign the standard deviation of the random walk governing the evolution of the regression coefficient was tripled (now 0.03 instead of 0.01). As a result, the observed data began to diverge much more quickly than before. Accordingly, estimations became considerably less reliable (Figure 4a, red). An example of the underlying data is shown in Figure 4b. The preceding simulations highlight the importance of a model that is sufficiently flexible to account for phenomena typically encountered in sea-20 K.H. BRODERSEN ET AL. 0 50 100 150 200 250 300 (a) absolute % error 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 15 20 25 (b) 2014-01 2014-02 2014-03 2014-04 2014-05 2014-06 Figure 4. Estimation accuracy. (a) Time series of absolute percentage discrepancy between inferred effect and true effect. The plot shows the rate (mean ± 2 s.e.m.) at which predictions become less accurate as the length of the counterfactual forecasting period increases (blue). The well-behaved decrease in estimation accuracy breaks down when the data are subject to a sudden structural change (red), as simulated for 1 April 2014. (b) Illustration of a structural break. The plot shows one example of the time series underlying the red curve in (a). On 1 April 2014, the standard deviation of the generating random walk of the local level was tripled, causing the rapid decline in estimation accuracy seen in the red curve in (a).BAYESIAN CAUSAL IMPACT ANALYSIS 21 sonal empirical data. This rules out entirely static models in particular (such as multiple linear regression). 4. Application to empirical data. To illustrate the practical utility of our approach, we analysed an advertising campaign run by one of Google’s advertisers in the United States. In particular, we inferred the campaign’s causal effect on the number of times a user was directed to the advertiser’s website from the Google search results page. We provide a brief overview of the underlying data below (see Vaver and Koehler, 2011, for additional details). The campaign analysed here was based on product-related ads to be displayed alongside Google’s search results for specific keywords. Ads went live for a period of 6 consecutive weeks and were geo-targeted to a randomised set of 95 out of 190 designated market areas (DMAs). The most salient observable characteristic of DMAs is offline sales. To produce balance in this characteristic, DMAs were first rank-ordered by sales volume. Pairs of regions were then randomly assigned to treatment/control. DMAs provide units that can be easily supplied with distinct offerings, although this finegrained split was not a requirement for the model. In fact, we carried out the analysis as if only one treatment region had been available (formed by summing all treated DMAs). This allowed us to evaluate whether our approach would yield the same results as more conventional treatment-control comparisons would have done. The outcome variable analysed here was search-related visits to the advertiser’s website, consisting of organic clicks (i.e., clicks on a search result) and paid clicks (i.e., clicks on an ad next to the search results, for which the advertiser was charged). Since paid clicks were zero before the campaign, one might wonder why we could not simply count the number of paid clicks after the campaign had started. The reason is that paid clicks tend to cannibalize some organic clicks. Since we were interested in the net effect, we worked with the total number of clicks. The first building block of the model used for the analyses in this section was a local level component. For the inverse-Gamma prior on its diffusion variance we used a prior estimate of s/ν = 0.1σy and a prior sample size ν = 32. The second structural block was a static regression component. We used a spike-and-slab prior with an expected model size of M = 3, an expected explained variance of R2 = 0.8 and 50 prior df. We deliberately kept the model as simple as this. Since the covariates came from a randomised experiment, we expected them to already account for any additional local linear trends and seasonal variation in the response variable. If one suspects22 K.H. BRODERSEN ET AL. that a more complex model might be more appropriate, one could optimize model design through Bayesian model selection. Here, we focus instead on comparing different sets of covariates, which is critical in counterfactual analyses regardless of the particular model structure used. Model estimation was carried out using 10 000 MCMC samples. Analysis 1: Effect on the treated, using a randomised control. We began by applying the above model to infer the causal effect of the campaign on the time series of clicks in the treated regions. Given that a set of unaffected regions was available in this analysis, the best possible set of controls was given by the untreated DMAs themselves (see below for a comparison with a purely observational alternative). As shown in Figure 5a, the model provided an excellent fit on the precampaign trajectory of clicks (including a spike in ‘week −2’ and a dip at the end of ‘week −1’). Following the onset of the campaign, observations quickly began to diverge from counterfactual predictions: the actual number of clicks was consistently higher than what would have been expected in the absence of the campaign. The curves did not reconvene until one week after the end of the campaign. Subtracting observed from predicted data, as we did in Figure 5b, resulted in a posterior estimate of the incremental lift caused by the campaign. It peaked after about three weeks into the campaign, and faded away after about one week after the end of the campaign. Thus, as shown in Figure 5c, the campaign led to a sustained cumulative increase in total clicks (as opposed to a mere shift of future clicks into the present, or a pure cannibalization of organic clicks by paid clicks). Specifically, the overall effect amounted to 88 400 additional clicks in the targeted regions (posterior expectation; rounded to three significant digits), i.e., an increase of 22%, with a central 95% credible interval of [13%, 30%]. To validate this estimate, we returned to the original experimental data, on which a conventional treatment-control comparison had been carried out using a two-stage linear model (Vaver and Koehler, 2011). This analysis had led to an estimated lift of 84 700 clicks, with a 95% confidence interval for the relative expected lift of [19%, 22%]. Thus, with a deviation of less than 5%, the counterfactual approach had led to almost precisely the same estimate as the randomised evaluation, except for its wider intervals. The latter is expected, given that our intervals represent prediction intervals, not confi- dence intervals. Moreover, in addition to an interval for the sum over all time points, our approach yields a full time series of pointwise intervals, which allows analysts to examine the characteristics of the temporal evolution of attributable impact. The posterior predictive intervals in Figure 5b widen more slowly than inBAYESIAN CAUSAL IMPACT ANALYSIS 23 4000 8000 12000 (a) US clicks Model fit Prediction pre-intervention intervention post-intervention -2000 0 2000 (b) Point-wise impact 0e+00 1e+05 (c) Cumulative impact week -4 week -3 week -2 week -1 week 0 week 1 week 2 week 3 week 4 week 5 week 6 week 7 Figure 5. Causal effect of online advertising on clicks in treated regions. (a) Time series of search-related visits to the advertiser’s website (including both organic and paid clicks). (b) Pointwise (daily) incremental impact of the campaign on clicks. Shaded vertical bars indicate weekends. (c) Cumulative impact of the campaign on clicks.24 K.H. BRODERSEN ET AL. the illustrative example in Figure 1. This is because the large number of controls available in this data set offers a much higher pre-campaign predictive strength than in the simulated data in Figure 1. This is not unexpected, given that controls came from a randomised experiment, and we will see that this also holds for a subsequent analysis (see below) that is based on yet another data source for predictors. A consequence of this is that there is little variation left to be captured by the random-walk component of the model. A reassuring finding is that the estimated counterfactual time series in Figure 5a eventually almost exactly rejoins the observed series, only a few days after the end of the intervention. Analysis 2: Effect on the treated, using observational controls. An important characteristic of counterfactual-forecasting approaches is that they do not require a setting in which a set of controls, selected at random, was exempt from the campaign. We therefore repeated the preceding analysis in the following way: we discarded the data from all control regions and, instead, used searches for keywords related to the advertiser’s industry, grouped into a handful of verticals, as covariates. In the absence of a dedicated set of control regions, such industry-related time series can be very powerful controls as they capture not only seasonal variations but also market-specific trends and events (though not necessarily advertiser-specific trends). A major strength of the controls chosen here is that time series on web searches are publicly available through Google Trends (http://www.google.com/trends/). This makes the approach applicable to virtually any kind of intervention. At the same time, the industry as a whole is unlikely to be moved by a single actor’s activities. This precludes a positive bias in estimating the effect of the campaign that would arise if a covariate was negatively affected by the campaign. As shown in Figure 6, we found a cumulative lift of 85 900 clicks (posterior expectation), or 21%, with a [12%, 30%] interval. In other words, the analysis replicated almost perfectly the original analysis that had access to a randomised set of controls. One feature in the response variable which this second analysis failed to account for was a spike in clicks in the second week before the campaign onset; this spike appeared both in treated and untreated regions and appears to be specific to this advertiser. In addition, the series of point-wise impact (Figure 6b) is slightly more volatile than in the original analysis (Figure 5). On the other hand, the overall point estimate of 85 900, in this case, was even closer to the randomised-design baseline (84 700; deviation ca. 1%) than in our first analysis (88 400; deviation ca. 4%). In summary, the counterfactual approach effectively obviated the need for the original randomised experiment. Using purely observationalBAYESIAN CAUSAL IMPACT ANALYSIS 25 4000 8000 12000 (a) US clicks Model fit Prediction pre-intervention intervention post-intervention -2000 0 2000 (b) Point-wise impact 0e+00 1e+05 (c) Cumulative impact week -4 week -3 week -2 week -1 week 0 week 1 week 2 week 3 week 4 week 5 week 6 week 7 Figure 6. Causal efect of online advertising on clicks, using only searches for keywords related to the advertiser’s industry as controls, discarding the original control regions as would be the case in studies where a randomised experiment was not carried out. (a) Time series of clicks on to the advertiser’s website. (b) Pointwise (daily) incremental impact of the campaign on clicks. (c) Cumulative impact of the campaign on clicks. The plots show that this analysis, which was based on observational covariates only, provided almost exactly the same inferences as the first analysis (Figure 5) that had been based on a randomised design.26 K.H. BRODERSEN ET AL. variables led to the same substantive conclusions. Analysis 3: Absence of an effect on the controls. To go one step further still, we analysed clicks in those regions that had been exempt from the advertising campaign. If the effect of the campaign was truly specific to treated regions, there should be no effect in the controls. To test this, we inferred the causal effect of the campaign on unaffected regions, which should not lead to a significant finding. In analogy with our second analysis, we discarded clicks in the treated regions and used searches for keywords related to the advertiser’s industry as controls. As summarized in Figure 7, no significant effect was found in unaffected regions, as expected. Specifically, we obtained an overall non-significant lift of 2% in clicks with a central 95% credible interval of [−6%, 10%]. In summary, the empirical data considered in this section showed: (i) a clear effect of advertising on treated regions when using randomised control regions to form the regression component, replicating previous treatmentcontrol comparisons (Figure 5); (ii) notably, an equivalent finding when discarding control regions and instead using observational searches for keywords related to the advertiser’s industry as covariates (Figure 6); (iii) reassuringly, the absence of an effect of advertising on regions that were not targeted (Figure 7). 5. Discussion. The increasing interest in evaluating the incremental impact of market interventions has been reflected by a growing literature on applied causal inference. With the present paper we are hoping to contribute to this literature by proposing a Bayesian state-space model for obtaining a counterfactual prediction of market activity. We discuss the main features of this model below. In contrast to most previous schemes, the approach described here is fully Bayesian, with regularizing or empirical priors for all hyperparameters. Posterior inference gives rise to complete-data (smoothing) predictions that are only conditioned on past data in the treatment market and both past and present data in the control markets. Thus, our model embraces a dynamic evolution of states and, optionally, coefficients (departing from classical linear regression models with a fixed number of static regressors) and enables us to flexibly summarize posterior inferences. Because closed-form posteriors for our model do not exist, we suggest a stochastic approximation to inference using MCMC. One convenient consequence of this is that we can reuse the samples from the posterior to obtain credible intervals for all summary statistics of interest. Such statistics include, for example, the average absolute and relative effect caused by theBAYESIAN CAUSAL IMPACT ANALYSIS 27 4000 8000 12000 (a) US clicks Model fit Prediction pre-intervention intervention post-intervention -2000 0 2000 (b) Point-wise impact 0e+00 1e+05 (c) Cumulative impact week -4 week -3 week -2 week -1 week 0 week 1 week 2 week 3 week 4 week 5 week 6 week 7 Figure 7. Causal effect of online advertising on clicks in non-treated regions, which should not show an effect. Searches for keywords related to the advertiser’s industry are used as controls. Plots show inferences in analogy with Figure 5. (a) Time series of clicks to the advertiser’s website. (b) Pointwise (daily) incremental impact of the campaign on clicks. (c) Cumulative impact of the campaign on clicks.28 K.H. BRODERSEN ET AL. intervention as well as its cumulative effect. Posterior inference was implemented in C++ and R and, for all empirical datasets presented in Section 4, took less than 30 seconds on a standard Linux machine. If the computational burden of sampling-based inference ever became prohibitive, one option would be to replace it by a variational Bayesian approximation (see Mathys et al., 2011; Brodersen et al., 2013, for examples). Another way of using the proposed model is for power analyses. In particular, given past time series of market activity, we can define a point in the past to represent a hypothetical intervention and apply the model in the usual fashion. As a result, we obtain a measure of uncertainty about the response in the treated market after the beginning of the hypothetical intervention. This provides an estimate of what incremental effect would have been required to be outside of the 95% central interval of what would have happened in the absence of treatment. The model presented here subsumes several simpler models which, in consequence, lack important characteristics, but which may serve as alternatives should the full model appear too complex for the data at hand. One example is classical multiple linear regression. In principle, classical regression models go beyond difference-in-differences schemes in that they account for the full counterfactual trajectory. However, they are not suited for predicting stochastic processes beyond a few steps. This is because ordinary leastsquares estimators disregard serial autocorrelation; the static model structure does not allow for temporal variation in the coefficients; and predictions ignore our posterior uncertainty about the parameters. Put differently: classical multiple linear regression is a special case of the state-space model described here in which (i) the Gaussian random walk of the local level has zero variance; (ii) there is no local linear trend; (iii) regression coefficients are static rather than time-varying; (iv) ordinary least squares estimators are used which disregard posterior uncertainty about the parameters and may easily overfit the data. Another special case of the counterfactual approach discussed in this paper is given by synthetic control estimators that are restricted to the class of convex combinations of predictor variables and do not include time-series effects such as trends and seasonality (Abadie, Diamond and Hainmueller, 2010; Abadie, 2005). Relaxing this restriction means we can utilize predictors regardless of their scale, even if they are negatively correlated with the outcome series of the treated unit. Other special cases include autoregressive (AR) and moving-average (MA) models. These models define autocorrelation among observations rather thanBAYESIAN CAUSAL IMPACT ANALYSIS 29 latent states, thus precluding the ability to distinguish between state noise and observation noise (Ataman, Mela and Van Heerde, 2008; Leeflang et al., 2009). In the scenarios we consider, advertising is a planned perturbation of the market. This generally makes it easier to obtain plausible causal inferences than in genuinely observational studies in which the experimenter had no control about treatment (see discussions in Berndt, 1991; Brady, 2002; Hitchcock, 2004; Robinson, McNulty and Krasno, 2009; Winship and Morgan, 1999; Camillo and d’Attoma, 2010; Antonakis et al., 2010; Lewis and Reiley, 2011; Lewis, Rao and Reiley, 2011; Kleinberg and Hripcsak, 2011; Vaver and Koehler, 2011). The principal problem in observational studies is endogeneity: the possibility that the observed outcome might not be the result of the treatment but of other omitted, endogenous variables. In principle, propensity scores can be used to correct for the selection bias that arises when the treatment effect is correlated with the likelihood of being treated (Rubin and Waterman, 2006; Chan et al., 2010). However, the propensityscore approach requires that exposure can be measured at the individual level, and it, too, does not guarantee valid inferences, for example in the presence of a specific type of selection bias recently termed ‘activity bias’ (Lewis, Rao and Reiley, 2011). Counterfactual modelling approaches avoid these issues when it can be assumed that the treatment market was chosen at random. Overall, we expect inferences on the causal impact of designed market interventions to play an increasingly prominent role in providing quantitative accounts of return on investment (Danaher and Rust, 1996; Seggie, Cavusgil and Phelan, 2007; Leeflang et al., 2009; Stewart, 2009). This is because marketing resources, specifically, can only be allocated to whichever campaign elements jointly provide the greatest return on ad spend (ROAS) if we understand the causal effects of spend on sales, product adoption, or user engagement. At the same time, our approach could be used for many other applications involving causal inference. Examples include problems found in economics, epidemiology, biology, or the political and social sciences. With the release of the CausalImpact R package we hope to provide a simple framework serving all of these areas. Structural time-series models are being used in an increasing number of applications at Google, and we anticipate that they will prove equally useful in many analysis efforts elsewhere. Acknowledgements. The authors wish to thank Jon Vaver for sharing the empirical data analysed in this paper.30 K.H. BRODERSEN ET AL. References. Abadie, A. (2005). Semiparametric Difference-in-Differences Estimators. The Review of Economic Studies 72 1–19. Abadie, A., Diamond, A. and Hainmueller, J. (2010). Synthetic control methods for comparative case studies: Estimating the effect of Californias tobacco control program. Journal of the American Statistical Association 105. Abadie, A. and Gardeazabal, J. (2003). The economic costs of conflict: A case study of the Basque Country. American economic review 113–132. Angrist, J. D. and Krueger, A. B. (1999). Empirical strategies in labor economics. Handbook of labor economics 3 1277–1366. Angrist, J. D. and Pischke, J.-S. (2008). Mostly Harmless Econometrics: An Empiricist’s Companion. Princeton University Press. Antonakis, J., Bendahan, S., Jacquart, P. and Lalive, R. (2010). On making causal claims: A review and recommendations. The Leadership Quarterly 21 1086–1120. Ashenfelter, O. and Card, D. (1985). Using the longitudinal structure of earnings to estimate the effect of training programs. The Review of Economics and Statistics 648–660. Ataman, M. B., Mela, C. F. and Van Heerde, H. J. (2008). Building brands. Marketing Science 27 1036–1054. Athey, S. and Imbens, G. W. (2002). Identification and Inference in Nonlinear DifferenceIn-Differences Models Working Paper No. 280, National Bureau of Economic Research. Banerjee, S., Kauffman, R. J. and Wang, B. (2007). Modeling Internet firm survival using Bayesian dynamic models with time-varying coefficients. Electronic Commerce Research and Applications 6 332–342. Belloni, A., Chernozhukov, V., Fernandez-Val, I. and Hansen, C. (2013). Program evaluation with high-dimensional data CeMMAP working papers No. CWP77/13, Centre for Microdata Methods and Practice, Institute for Fiscal Studies. Berndt, E. R. (1991). The practice of econometrics: classic and contemporary. AddisonWesley Reading, MA. Bertrand, M., Duflo, E. and Mullainathan, S. (2002). How Much Should We Trust Differences-in-Differences Estimates? Working Paper No. 8841, National Bureau of Economic Research. Brady, H. E. (2002). Models of causal inference: Going beyond the Neyman-RubinHolland theory. In Annual Meetings of the Political Methodology Group. Brodersen, K. H., Daunizeau, J., Mathys, C., Chumbley, J. R., Buhmann, J. M. and Stephan, K. E. (2013). Variational Bayesian mixed-effects inference for classifi- cation studies. NeuroImage 76 345–361. Camillo, F. and d’Attoma, I. (2010). A new data mining approach to estimate causal effects of policy interventions. Expert Systems with Applications 37 171–181. Campbell, D. T., Stanley, J. C. and Gage, N. L. (1963). Experimental and quasiexperimental designs for research. Houghton Mifflin Boston. Card, D. and Krueger, A. B. (1993). Minimum wages and employment: A case study of the fast food industry in New Jersey and Pennsylvania Technical Report, National Bureau of Economic Research. Carter, C. K. and Kohn, R. (1994). On Gibbs sampling for state space models. Biometrika 81 541-553. Chan, D., Ge, R., Gershony, O., Hesterberg, T. and Lambert, D. (2010). Evaluating Online Ad Campaigns in a Pipeline: Causal Models at Scale. In Proceedings of ACM SIGKDD 2010 7–15.BAYESIAN CAUSAL IMPACT ANALYSIS 31 Chipman, H., George, E. I., McCulloch, R. E., Clyde, M., Foster, D. P. and Stine, R. A. (2001). The practical implementation of Bayesian model selection. Lecture Notes-Monograph Series 65–134. Claveau, F. (2012). The RussoWilliamson Theses in the social sciences: Causal inference drawing on two types of evidence. Studies in History and Philosophy of Science Part C: Studies in History and Philosophy of Biological and Biomedical Sciences 0. Cox, D. and Wermuth, N. (2001). Causal Inference and Statistical Fallacies. In International Encyclopedia of the Social & Behavioral Sciences (E. in Chief:Neil J. Smelser and P. B. Baltes, eds.) 1554–1561. Pergamon, Oxford. Danaher, P. J. and Rust, R. T. (1996). Determining the optimal return on investment for an advertising campaign. European Journal of Operational Research 95 511–521. de Jong, P. and Shephard, N. (1995). The simulation smoother for time series models. Biometrika 82 339–350. Donald, S. G. and Lang, K. (2007). Inference with Difference-in-Differences and Other Panel Data. Review of Economics and Statistics 89 221–233. Durbin, J. and Koopman, S. J. (2002). A Simple and Efficient Simulation Smoother for State Space Time Series Analysis. Biometrika 89 603–616. Fruhwirth-Schnatter, S. ¨ (1994). Data augmentation and dynamic linear models. Journal of Time Series Analysis 15 183–202. George, E. I. and McCulloch, R. E. (1993). Variable selection via Gibbs sampling. Journal of the American Statistical Association 88 881–889. George, E. I. and McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statistica Sinica 7 339–374. Ghosh, J. and Clyde, M. A. (2011). Rao-Blackwellization for Bayesian Variable Selection and Model Averaging in Linear and Binary Regression: A Novel Data Augmentation Approach. Journal of the American Statistical Association 106 1041–1052. Hansen, C. B. (2007a). Asymptotic properties of a robust variance matrix estimator for panel data when T is large. Journal of Econometrics 141 597–620. Hansen, C. B. (2007b). Generalized least squares inference in panel and multilevel models with serial correlation and fixed effects. Journal of Econometrics 140 670–694. Heckman, J. J. and Vytlacil, E. J. (2007). Econometric Evaluation of Social Programs, Part I: Causal Models, Structural Models and Econometric Policy Evaluation. In Handbook of Econometrics, (J. J. Heckman and E. E. Leamer, eds.) 6, Part B 4779–4874. Elsevier. Hitchcock, C. (2004). Do All and Only Causes Raise the Probabilities of Effects? In Causation and Counterfactuals MIT Press. Hoover, K. D. (2012). Economic Theory and Causal Inference. In Philosophy of Economics, (U. Mki, ed.) 13 89–113. Elsevier. Kleinberg, S. and Hripcsak, G. (2011). A review of causal inference for biomedical informatics. Journal of Biomedical Informatics 44 1102–1112. Leeflang, P. S., Bijmolt, T. H., van Doorn, J., Hanssens, D. M., van Heerde, H. J., Verhoef, P. C. and Wieringa, J. E. (2009). Creating lift versus building the base: Current trends in marketing dynamics. International Journal of Research in Marketing 26 13–20. Lester, R. A. (1946). Shortcomings of marginal analysis for wage-employment problems. The American Economic Review 36 63–82. Lewis, R. A., Rao, J. M. and Reiley, D. H. (2011). Here, there, and everywhere: correlated online behaviors can lead to overestimates of the effects of advertising. In Proceedings of the 20th international conference on World wide web. WWW ’11 157– 166. ACM, New York, NY, USA.32 K.H. BRODERSEN ET AL. Lewis, R. A. and Reiley, D. H. (2011). Does Retail Advertising Work? Technical Report. Liang, F., Paulo, R., Molina, G., Clyde, M. A. and Berger, J. O. (2008). Mixtures of g-priors for Bayesian variable selection. Journal of the American Statistical Association 103 410-423. Mathys, C., Daunizeau, J., Friston, K. J. and Stephan, K. E. (2011). A Bayesian Foundation for Individual Learning Under Uncertainty. Frontiers in Human Neuroscience 5. Meyer, B. D. (1995). Natural and Quasi-Experiments in Economics. Journal of Business & Economic Statistics 13 151. Morgan, S. L. and Winship, C. (2007). Counterfactuals and causal inference: Methods and principles for social research. Cambridge University Press. Nakajima, J. and West, M. (2013). Bayesian analysis of latent threshold dynamic models. Journal of Business & Economic Statistics 31 151–164. Polson, N. G. and Scott, S. L. (2011). Data augmentation for support vector machines. Bayesian Analysis 6 1–23. Robinson, G., McNulty, J. E. and Krasno, J. S. (2009). Observing the Counterfactual? The Search for Political Experiments in Nature. Political Analysis 17 341–357. Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology; Journal of Educational Psychology 66 688. Rubin, D. B. (2007). Statistical Inference for Causal Effects, With Emphasis on Applications in Epidemiology and Medical Statistics. In Handbook of Statistics, (J. M. C. R. Rao and D. Rao, eds.) 27 28–63. Elsevier. Rubin, D. B. and Waterman, R. P. (2006). Estimating the Causal Effects of Marketing Interventions Using Propensity Score Methodology. Statistical Science 21 206-222. Scott, J. G. and Berger, J. O. (2010). Bayes and empirical-Bayes multiplicity adjustment in the variable-selection problem. Annals of Statistics 38 2587-2619. Scott, S. L. and Varian, H. R. (2013). Predicting the Present with Bayesian Structural Time Series. International Journal of Mathematical Modeling and Optimization. (forthcoming). Seggie, S. H., Cavusgil, E. and Phelan, S. E. (2007). Measurement of return on marketing investment: A conceptual framework and the future of marketing metrics. Industrial Marketing Management 36 834–841. Shadish, W. R., Cook, T. D. and Campbell, D. T. (2002). Experimental and quasiexperimental designs for generalized causal inference. Wadsworth Cengage learning. Solon, G. (1984). Estimating autocorrelations in fixed-effects models. National Bureau of Economic Research Cambridge, Mass., USA. Stewart, D. W. (2009). Marketing accountability: Linking marketing actions to financial results. Journal of Business Research 62 636–643. Takada, H. and Bass, F. M. (1998). Multiple Time Series Analysis of Competitive Marketing Behavior. Journal of Business Research 43 97–107. Vaver, J. and Koehler, J. (2011). Measuring Ad Effectiveness Using Geo Experiments Technical Report, Google Inc. Vaver, J. and Koehler, J. (2012). Periodic Measurement of Advertising Effectiveness Using Multiple-Test-Period Geo Experiments Technical Report, Google Inc. West, M. and Harrison, J. (1997). Bayesian Forecasting and Dynamic Models. Springer. Winship, C. and Morgan, S. L. (1999). The estimation of causal effects from observational data. Annual review of sociology 659–706.BAYESIAN CAUSAL IMPACT ANALYSIS 33 Zellner, A. (1986). On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti (P. K. Goel and A. Zellner, eds.) 233–243. North-Holland/Elsevier. Google, Inc. 1600 Amphitheatre Parkway Mountain View CA 94043, U.S.A. Estimating reach curves from one data point Georg M. Goerg Google Inc. Last update: November 21, 2014 Abstract Reach curves arise in advertising and media analysis as they relate the number of content impressions to the number of people who have seen it. This is especially important for measuring the effectiveness of an ad on TV or websites (Nielsen, 2009; PricewaterhouseCoopers, 2010). For a mathematical and datadriven analysis, it would be very useful to know the entire reach curve; advertisers, however, often only know its last data point, i.e., the total number of impressions and the total reach. In this work I present a new method to estimate the entire curve using only this last data point. Furthermore, analytic derivations reveal a surprisingly simple, yet insightful relationship between marginal cost per reach, average cost per impression, and frequency. Thus, advertisers can estimate the cost of an additional reach point by just knowing their total number of impressions, reach, and cost. A comparison of the proposed one-data point method to two competing regression models on TV reach curve data, shows that the proposed methodology performs only slightly poorer than regression fits to a collection of several points along the curve. 1 Introduction Let k+ reach, rk, be the percentage of the population that is exposed to a campaign at least k times. As usual, we measure impressions in gross rating points (GRPs), which is calculated as number of impressions divided by total (target) population multiplied by 100 (measured in percent). Equipped with a functional form of the reach curve, a variety of quantities of interest can be computed, e.g., marginal cost per reach or maximum possible reach. Advertisers, however, often only have two points of the reach curve rk(g): rk(0) = 0 and rk(G) = R ∈ [0, 100], (1) where G ≥ 0 is the total GRPs and R is total reach. With this information alone one is tempted to use a linear approximation r (1) k (g) = R G g. However, reach curves are not linear and in particular, the marginal reach per GRP would equal average reach per GRP (= 1/frequency); thus (1) alone is not helpful to get a better estimate of marginal GRP (and thus cost) per reach at g = G. While the behavior of rk(g) around g = G is in general unknown, the tangent at g = 0 can be approximated quite well: starting with no exposure, adding an infinitesimally small unit of GRPs (say ) one reaches  · ι % of the population, where ι = ι(k) is the reciprocal of the expected number of impressions needed for the first person to see k impressions. One can lower bound ι by 1/k. For k = 1, the bound is tight, ι = 1; getting an exact expression of ι for k > 1 is ongoing research.1 That is, for small g the reach curve can be approximated with a line through (0, 0) with slope ι: rk(g) ≈ g · ι for small g. (2) Thus, approximately, lim G→0 ∂ ∂g rk(g = G) = ι. (3) Combining (1) with (3) allows us to estimate a two-parameter model. Section 2 reviews parametric models for reach curves. Section 3 derives the parameter estimates based on 1 In practice we found that ι = (k + log2 k)−1 gives good fits for several k ≥ 1. 12.2 Conditional Logit 2 REACH CURVE MODELS the total GRP and reach. Simulations and comparisons to full least squares estimates are presented in Section 4. Finally, Section 5 summarizes the main findings and discusses future work. Details on the TV reach curve data and analytical derivations can be found in the Appendix. 2 Reach curve models Let X ≥ 0 be the number of content impressions, e.g., TV shows, websites, or commercials. For a probabilistic view of reach curves, it is useful to decompose k+ reach as P (X ≥ k,reachable) = P (X ≥ k | reachable) · P (reachable) (4) ⇔ rk = pk · ρ, (5) where ρ is the maximum possible reach, and pk is the probability of being reached k times, given that an individual is indeed reachable. This distinction allows us to model ρ and pk with separate probabilistic models. Since reach is usually denoted in percent, we also use percent for maximum possible reach ρ ∈ [0, 100], while we use proportions for pk ∈ [0, 1]. For further analytical derivations it is necessary to parametrize pk(g). Below we review two functional forms which are parsimonious (2 + 1 parameters), have excellent empirical fits, and lend themselves for simple analytical derivations. 2.1 Gamma-Mixture Jin et al. (2012) propose a Poisson distribution for the impressions g, with an exponential prior distribution with rate β on the Poisson rate λ. This yields a model of the form pk(g) = 1 − β g + β . (6) The exponential prior can be generalized to a Γ(α, β) distribution, which yields rk(g) = ρ  1 −  β β + g α . (7) By construction, (6) is nested in (7), which can be tested using a hypothesis test for H0 : α = 1. 2.1.1 Marginal reach The derivative of (7) with respect to g equals2 ∂ ∂g pk(g) = α β  β g + β α+1 , (8) with limg→0 ∂ ∂g rk(g) = ρα β . (9) Eq. (9) has three degrees of freedom; since only two data points are available, one parameters has to be fixed. Given the nested structure of the exponential model, it is natural to set α ≡ 1. 2.2 Conditional Logit As an alternative we propose a logistic regression logit(pk(g)) = β0 + β1 · log g, (10) where logit(p) = log p 1−p , and β0 and β1 are intercept and slope.3 Using the logit inverse expit(x) = e x 1+e x = 1 1+e−x , Eq. (10) can be rewritten as pk = expit(β0 + β1 log g) = e β0+β1 log g 1 + e β0+β1 log g (11) = 1 − 1 1 + e β0 · g β1 (12) = 1 − e −β0 e−β0 + g β1 (13) which shows similarity to (7). In fact, identifying β ≡ e −β0 , both models coincide if α = 1 and β1 = 1, respectively. Again, this can be tested using a two-sided hypothesis test for H0 : β1 = 1. The Logit conditional model can also be interpreted as the baseline Gamma mixture model with α ≡ 1, but with transformed GRPs, ˜g = g β1 , in (7). Here β1 can be interpreted as a parameter that measures the efficiency of GRPs: for β1 > 1 GRPs are more efficient than baseline; for β1 = 1 GRPs are spent according to the baseline model; and for β1 < 1 are not spent as efficiently as expected. For an empirical estimates see Section 4. 2See Section B.1 for details. 3We deliberately do not use α and β to parametrize intercept and slope, as it is prone to confusion with the (reversed) roles of α and β in (8). Google Inc. 23.1 ρ 1. (15) Thus for the logit model one has to assume β1 = 1 to use the linear approximation of R(g) at g = 0 for 1+ reach.4 3 Methodology Equipped with the two parameter model r(g; ρ, β) = ρ  1 − β β + g  = ρ g β + g ∈ [0, ρ], (16) we can use the tangent approximation in (3) and total GRP and reach to estimate ρ and β. Note that β ≥ 0 is a saturation parameter and controls how efficient GRPs are: for small β reach grows quickly with GRPs, for large β it grows slowly. Its derivative equals r 0 (g; ρ, β) = ρ β (β + g) 2 , (17) which at g = 0 evaluates to r 0 (0) = ρ β . This gives a system of two equations (maximum GRP and reach & marginal reach at 0) with two unknowns, ρ ∈ [0, 100] and β > 0: ρ β = ι ⇔ ρ = β · ι, (18) ρ G β + G = R ⇔ ρ = R(G + β) G . (19) First note that for 1+ reach, ρ ≡ β since ι(k = 1) = 1. Moreover, ρ in (19) satisfies ρ ≥ 0 for all β, but it satisfies ρ ≤ 100 only for β ≤ G · 100−R R . 4For k > 1, the Logit model with β1 > 1 might become useful as the marginal k+ reach for the very first impression is 0. However, one then has to estimate three parameters again, which is not possible without any further assumptions or more than one data point. Solving for β and plugging in to ρ = ρ(β) gives ρb = min  G · R G − R/ι, 100 , (20) and βb = (ρb ι = G·R/ι G−R/ι , if ρ < 100, G · 100−R R , if ρ = 100. (21) Condition ρb ≤ 100 is equivalent to G ≤ 100 ι R 100−R ; thus GRPs must be less or equal to a constant times the odds ratio of reach. Plugging them back into (16) yields expressions for reach solely as a function of R and G (details see Appendix B). According to (21) we consider the two scenarios separately. 3.1 ρ 1 − F θ pz > 1 − F. The first inequality will be close to an equality when pz and hence θ is small. For our applications 1 − F θ/pz is a reasonable approximation to the variance ratio. The second inequality reflects the fact that pooling the data cannot possibly be better than what we would get with an SSP of size n + N. From var( f ˆθI)/var(ˆθS) ≈ 1 − F θ/pz we see that using the BRP is effectively like multiplying the SSP sample size n by 1/(1−F θ/pz). Our greatest precision gains come when a high fraction of online reaches are incremental, that is, when θ/pz is largest. In our application this proportion ranges from 20% to 50% when aggregated to the campaign level. See Table 2.1 in Section 2. 3.2 Gain from the CIA alone Here we evaluate the variance reduction that would follow from the CIA. In that case, we could take advantage of the Z–Y independence, and estimate θ by ˆθC = Z¯ S(1 − Y¯ S). It is shown in the Appendix that the delta method variance of ˆθC satisfies var( f ˆθC) var(ˆθS) = 1 − py(1 − pz) 1 − θ > 1 − py, (2) when the CIA holds. This can represent a dramatic improvement, when the online reach pz and incremental reach θ are both small while the TV reach py is large. If the CIA holds, our application data suggest the variance reduction can be from 50% to 80%. The reverse setting with tiny TV reach and large online reach would not be favorable to ˆθC, but our data are not of that type. 3.3 Gain from the CIA and IDA Finally, suppose that both the CIA and IDA hold. If we apply both assumptions, we can get the estimator ˆθI,C = (fZ¯ S + FZ¯B)(1 − Y¯ S). We already gain a lot4 Example campaigns 7 from the CIA, so it is interesting to see how much more the IDA adds when the CIA holds. We show in the Appendix that under both assumptions, var( f ˆθI,C) var( f ˆθC) = f(1 − py)(1 − pz) + pypz (1 − py)(1 − pz) + pypz . If both reaches are high then we gain little, but if both reaches are small then we reduce the variance by almost a factor of f, when adding the IDA to the CIA. In our case we expect that the television reach is large but the online reach is small, fitting neither of these extremes. Consider a campaign with f = 1/3, py = 2/3 and pz = 99/100, similar to the soap campaigns. For such a campaign, var( f ˆθI,C) var( f ˆθC) = (1/9) × .99 + (2/3) × .01 (1/3) × .99 + (2/3) × .01 .= .34, so the combined assumptions then allow a nearly three-fold variance reduction compared to CIA alone. 4 Example campaigns Our data enrichment scheme is described in Section 5. Here we illustrate the results from that scheme on six marketing campaigns and discuss the differences among different algorithms. In addition to data enrichment, we also show results from tree structured models. Those split the data into groups and recursively split the groups. More about tree fitting is in Section 5. One model fits a tree to the SSP data alone and another one works with the pooled SSP and BRP data. For all three of those methods we have aggregated the predictions over the age variable, which takes six levels. In addition, we show the empirical results for age, which amount to recording the percentage of incremental reaches, that is, data with Z(1 − Y ) = 1, at each unique level of age in the SSP. There is no corresponding empirical prediction fully disaggregated by age, gender, income and education, because of the great many empty cells that would cause. We found the age related patterns of incremental reach particularly interesting. Figure 4.1 shows estimated incremental reach for all three models and the empirical counts, on all six campaigns, averaged over age groups. The beer campaign is particularly telling. The empirical data show a decreasing trend of incremental reach with age. The tree fit to SSP-only data yields a fit that is constant in age. The tree model had to explore splitting the data on all four variables without a prior focus on age. There were only 23 incremental reach events for beer in the SSP data set. With such a small number of events and four predictors, there is considerable possibility of overfitting. Cross-validation lead to a model that grouped the entire SSP into one set, that is, the tree had no splits. Both pooling and data enrichment were able to borrow strength from the BRP as well as take advantage of approximate independence of television and web exposure. They then recover the trend with age.5 Data enrichment for incremental reach 8 Age level Incremental reach (%) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1 2 3 4 5 6 ● ● ● ● ● ● Beer 2 4 6 8 10 ● ● ● ● ● ● Chrome 0.0 0.5 1.0 1.5 1 2 3 4 5 6 ● ● ● ● ● ● Salt 0.0 0.5 1.0 1.5 2.0 ● ● ● ● ● ● Soap 1 1 2 3 4 5 6 0.5 1.0 1.5 2.0 ● ● ● ● ● ● Soap 2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 ● ● ● ● ● ● Soap 3 ● Emp SSP Pool DEIR Fig. 4.1: Estimated incremental reach by age, for six campaigns and three models: SSP, Pooling and DEIR as described in the text. Empirical counts are marked by Emp. The Salt campaign had a similarly small number of incremental reaches and once again the SSP only tree was constant. Fitting a tree to the SSP data always gave a flatter fit versus age than did DEIR which in turn was flatter than what we would get simply pooling the data. Section 6 gives simulations in which DEIR has greater accuracy than using pooling or SSP only. 5 Data enrichment for incremental reach For a given sample we would like to combine incremental reach estimates ˆθS, ˆθI , ˆθC and ˆθI,C whose assumptions are: none, IDA, CIA and IDA+CIA, respectively. The latter three add some value if their corresponding assumptions are nearly true, but our information about how well those assumptions hold comes from the same data we are using to form the estimates. The circumstances are similar to those in data enriched linear regression (Chen5 Data enrichment for incremental reach 9 et al., 2013). In that problem there is a regression model Yi = XT i β + εi which holds in the SSP and a biased regression model Yi = XT i (β + γ) + εi holds in the BRP. The estimates are found by minimizing S(λ) = X i∈S (Yi − XT i β) 2 + X i∈B (Yi − XT i (β + γ))2 + λ X i∈S (XT i γ) 2 , (3) over β and γ for a nonnegative penalty factor λ. The εi are independent with mean 0 and variance σ 2 S in the SSP and σ 2 B in the BRP. Taking λ = 0 amounts to fitting regressions separately in the two samples yielding an estimate βˆ that does not use the BRP at all. The limit λ → ∞ corresponds to pooling the two data sets, which would be optimal if there were no bias, i.e., if γ = 0. The specific penalty in (3) discourages the estimated γ from making large changes to the SSP; it is one of several penalties considered in that paper. Varying λ from 0 to ∞ gives a family of estimators that weight the SSP to varying degrees. The optimal λ is unknown. An oracle that knew γ and the error variance in the two data sets would be able to compute the optimal λ under a mean squared error loss. Chen et al. (2013) get a formula for the oracle’s λ and then plug estimates of γ and the variances into that formula. They show, under conditions, that the resulting plugin estimate gives better estimates of β than using the SSP only would. The conditions are that the Y values are normally distributed, and that the model have at least 5 regression parameters and 10 error degrees of freedom. The normality assumption allows a technical lemma due to Stein (1981) to be used and we believe that gains from using the BRP do not require normality. In principle we might multiply the sum of squared errors in the BRP by τ = σ 2 S /σ2 B if that ratio is known. If σ 2 BRP > σ2 SSP then we should put less weight on the BRP sample relative to the SSP sample. However the same effect is gained by increasing λ. Since the algorithm searches for optimal λ over a wide range it is less important to precisely specify τ . Chen et al. (2013) took τ = 1, simply summing all squared errors, and we will generalize that approach. For the present setting we must modify the method. First our responses are binary, not Gaussian. Second we have four estimators to combine, not two. Third, those estimators are dependent, being fit to overlapping data sets. 5.1 Modification for binary response To address the binary response there are two reasonable choices. One is to employ logistic regression. The other is to use tree-structured regression and then pool the estimators at the leaves of the tree. Regarding prediction accuracy, there is no unique best algorithm. There will be data sets for which simple logistic regression outperforms tree based classifiers and vice versa. For this paper we have adopted trees. Tree structured models have two practical advantages. First, the resulting cells that they select correspond to empirically determined market segments, which are then interpretable. Sec-5 Data enrichment for incremental reach 10 Data set Source Imputed V Assumptions D0 SSP ZS(1 − YS) none D1 BRP ZB(1 − YbSSP(XB, ZB)) IDA D2 SSP ZbSSP(XS)(1 − YbSSP(XS)) CIA D3 SSP ZbSSP+BRP(XS)(1 − YbSSP(XS)) CIA & IDA Tab. 5.1: Four incremental reach data sets and their imputed incremental reaches. The hats denote model-imputed values. For example YbSSP(XB, ZB) is a predictive model for Y based on values X and Z fit using data from SSP and evaluated at X = XB and Z = XB (from BRP). ond, within any of those cells, the model is intercept-only. Then both logistic regression and least squares reduce to a simple average. Each leaf of the regression tree defines a subset of the data that we call a cell. There are cells 1, . . . , C. The SSP has nc observations in cell c and the BRP has Nc observations there. For each cell and each set of assumptions we use a linear regression model relating an incremental reach quantity like Vei to an intercept. When there are no assumptions then Vei is the observed incremental reach for i ∈ S. Otherwise we may take advantage of the assumptions to impute values Vei using more of the data. The incremental reach values for each set of assumptions are given in Table 5.1. The predictive models shown there are all fit using rpart. For k = 0, 1, 2, 3 let Vek be vector of imputed responses under any of the assumptions from Table 5.1 and Xek their corresponding predictors. The regression framework minimizes kVe0 − Xe0βk 2 + X 3 k=1 kVek − Xek(β + γk)k 2 + X 3 k=1 λkkXe0γkk 2 . (4) over β and γk for penalties λk. In our setting each Xek is a column vector of ones of length mk. For cell c, m1k = Nc and m0k = m2k = m3k = nc. 5.2 Search for λk It is very convenient to search for suitable weights in the simplex ∆(K) = {(ω0, ω1, . . . , ωK) | ωk > 0, X K k=0 ωk = 1} because it is a bounded set, unlike the set [0, ∞] K of usable vectors λ = (λ1, . . . , λK). Chen et al. (2013) remark that it is more reasonable to use a common set of λk over all cells, stemming from unequal sample sizes. The search we use combines the advantages of both approaches.5 Data enrichment for incremental reach 11 Our search strategy for the simplex is to choose a grid of weight vectors ωg = (ωg0, ωg1, . . . , ωgK) ∈ ∆(K) , g = 1, . . . , G. For each vector ωg we find a vector λg = (λ1, . . . , λK) such that X C c=1 pcωk,c = ωgk, k = 0, 1, . . . , K, where pc is the proportion of our target population in cell c. That is, the population average weight of ωk,c matches ωgk. These weights give us the vector λg = (λg1, . . . , λgK). Using λg in the penalty criterion (4) specifies the weights we use within each cell. Our algorithm chooses the tree and the vector ω jointly using cross-validation. It is computationally expensive to make high dimensional searches. With K factors there is a K − 1 dimensional space of weights to search. Adding in the tree size gives a K’th dimension. As a result, combining all of our estimators requires us to search a 4 dimensional grid of values. We have chosen to set one of the ωk to 0 to reduce the search space from 4 dimensions to 3. We always retained the unbiased estimate ˆθS along with two others. In some computations reported in section A.4 of the Appendix we find only small differences among setting ω1 = 0, or ω2 = 0 or ω3 = 0. The best outcome was setting ω1 = 0. That has the effect of removing the estimate based on IDA only. As we saw in section 3, the IDA-only model had the least potential to improve our estimate. As a bonus, all three of the retained submodels have the same sample sizes and then common λ over cells coincides with common ω over cells. In the special case with ω1 = 0 we find after some calculus that the minimizer of (4) has βˆ c = V¯ 0c + P k∈{2,3} λk 1+λk V¯ kc 1 + P k∈{2,3} λk 1+λk ≡ X k∈{0,2,3} ωkc(λ)Vkc (5) where V¯ kc is the simple average of Vek over i ∈ S for cell c. Our default grid takes all values of ω whose coefficients are integer multiples of 10%. Populations D0, D2 and D3 all have the sample size n and of these only D0 is surely unbiased. An observation in D0 is worth at least as much as an observation in D2 or D3 and so we require ω0 > max{ω2, ω3}. Figure 5.1 shows this region and the set of 24 weight combinations that we use. 5.3 Search for tree size Here we give a brief review of regression trees in order to define our algorithm. For a full description see the monograph by Breiman et al. (1985). The version we use is the function rpart (Therneau and Atkinson, 1997) in the R programming language (R Core Team, 2012).5 Data enrichment for incremental reach 12 Weight region ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Weight points Fig. 5.1: The left panel shows the simplex of weights applied to data sets D0, D2 and D3 with the unbiased data set D0 in the lower left. The shaded region has the valid weights. The right panel shows that region with points for the 24 weights we use in our algorithm. Regression trees are built from splits of the set of subjects. A split uses one of the features in X and creates two subsets based on the values of that feature. For example it might split males from females or it might split those with the two smallest education levels from the others. Such a split defines two subpopulations of our target population and it equally defines two subsamples of our sample. A regression tree is a recursively defined set of splits. After the subjects are split into two groups based on one variable, each of those two groups may then be split again, using the same or different variables. Recursive splitting of splits yields a tree structure with subsets of subjects in the leaf nodes. Given a tree, we predict for subjects by a rule based on the leaf to which they belong. That rule uses the average within the subject’s leaf node. The tree is found by a greedy search that minimizes a measure of prediction error. In our case, the measure R(T), is the sum of squared prediction errors. By construction any tree with more splits than T has lower error and this brings a risk of overfitting. To counter overfitting, rpart adds a penalty proportional to the number |T| of leaves in tree T. The penalized criterion is R(T) + α|T| where the parameter α > 0 is chosen by M-fold cross-validation. This reduces the potentially complicated problem of choosing a tree to the simpler problem of selecting a scalar penalty parameter α. The rpart function has one option that we have changed from the default. That parameter is cp, the complexity parameter. The default is 10−2 . The cp parameter stops tree growing early if a proposed split improves R(T) by less6 Numerical investigation 13 than a factor of cp. We set cp = 10−4 . Our choice creates somewhat larger trees to get more choices to use in cross-validation. 5.4 The algorithm Here is a summary of the entire algorithm. First we make the following preprocessing steps. 1) Fit a large tree T by rpart relating observed incremental reaches Vi to predictor variables Xi in the SSP data. This tree returns a nested sequence of subtrees T0 ⊂ T1 ⊂ · · · ⊂ TL ⊂ T . Each T` corresponds to a critical value α` of the penalty. Choosing α` from this list selects the tree T`. The value L is data-dependent, and chosen by rpart. 2) Specify a grid of values ωg for g = 1, . . . , G. Here ωg = (ωg0, ωg1, . . . , ωgK) with ωgk > 0 and PG k=0 ωgk = 1. 3) Randomly partition SSP data (Xi , Yi , Zi) into M folds Sm for m = 1, . . . , M each of roughly equal size n/M. For fold m the SSP will contain ∪Sm0 for all m0 6= m. We call this S−m. The BRP for fold m is the entire BRP. We also considered using a bootstrap sample for the fold m BRP, but that was more expensive and less accurate in our numerical investigation as described in section A.4 of the Appendix. After this precomputation, our algorithm proceeds to the cross-validation shown in Figure 5.2 to make a joint selection of the tree penalty parameter α` and the simplex grid point ωg. Let the chosen values be α∗ and ω∗. We select the tree T∗ from step 1 above, corresponding to penalty parameter α∗. We treat each leaf node of T∗ as a cell c. We translate ω∗ into the corresponding λc in every cell c of tree T∗. Then we minimize (4) using this λc and the resulting βˆ c is our estimate Vbc of incremental reach in cell c. After choosing the tuning parameters ωg and α` by cross-validation, we use these parameters on the whole data set to make our final prediction. 6 Numerical investigation In order to measure the effect of data enriched estimates on incremental reach, we conducted a simulation where we knew the ground truth. Our goal is to predict for ensembles, not for individuals, so we constructed two large populations in which ground truth was known to us, simulated our process of subsampling them, and scored predictions against the ground truth incremental reach probabilities. To make our large samples realistic, we built them from our real data. We created S- and B-populations by replicating our SSP (respectively BRP) records 100 times each. Then in each simulation, we form an SSP by drawing 6000 observations at random from the S-population, and a BRP by drawing 13,000 observations at random from the B-population.6 Numerical investigation 14 for ` = 1, . . . , L do // initialize error sum of squares for g = 1, . . . , G do SSE`,g ← 0 for m = 1, . . . , M do // folds construct Table 5.1 for fold m, using S−m and B fit tree Tm for fold m by rpart prune tree Tm to T1,m, . . . , TL,m, tree T`,m uses α` for ` = 1, . . . , L do // tree sizes define cells S−m,c and Bc, c = 1, . . . , C from leaves of T`,m for g = 1, . . . , G do // simplex weights convert ωg into λg for c = 1, . . . , C do // cells compute Vek for k = 0, 2, 3 in cell c get Vbc = βˆ c from the weighted average (5) Vc ← 1 |Sm,c| X i∈Sm,c Vi // held out incr. reach pc ← fraction of true S population in cell c SSE`,g ←SSE`,g + pc(Vbc − Vc) 2 Fig. 5.2: Data enrichment for incremental reach (deir) algorithm. After precomputation described on page 13 we run this cross-validation algorithm to choose the complexity parameter α` and the weights ωg, as the joint minimizers ` ∗ and g ∗ of SSE`,g. The values pc come from a census or from the SSP if the census does not have the variables we need. We use M = 10. For each campaign, we apply deir with this sample data to estimate the incremental reach Vˆ (x). We used 10–fold cross-validation. The mean square estimation error (MSE) is P x p(x)(Vˆ (x) − V (x))2 . This sum is taken over all x values in the SSP. The simulation above was repeated 1000 times. The root mean square error was divided by the true incremental reach to get a relative RMSE. We consider two comparison methods. The first is to use the SSP only. That method computes ˆθS within the leaves of a tree. The tree is found by rpart. The second comparison is a tree fit by rpart to the pooled SSP and BRP data and using both CIA and IDA. We do not compare to the empirical fractions because many of them are from empty cells. Figure 6.1 compares the relative errors in the SSP only method to data6 Numerical investigation 15 DEIR RMSE (%) SSP RMSE (%) 1 2 3 1 2 3 ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●●● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● Beer 6 8 10 12 3 4 5 6 7 8 9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Chrome 0.5 1.0 1.5 2.0 2.5 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Salt 0 1 2 3 4 5 6 0 1 2 3 4 ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 1 0.5 1.0 1.5 2.0 2.5 3.0 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 2 0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 3 Fig. 6.1: Performance comparison, SSP only versus data enrichment, predictive relative mean square errors. There is one panel for each of 6 campaigns with one point for each of 1000 replicates. The reference line is the forty-five degree line. enrichment. Data enrichment is consistently better over all 6 campaigns we simulated in the great majority of replications. It is clear that the populations are similar enough that using the larger data set improves estimation of incremental reach. Under the IDA we can pool the SSP and BRP together using rpart on the combined data to estimate Pr(Z = 1 | X). Under the CIA we can multiply this estimate by Pr(Y = 0 | X) fit by rpart to the SSP, see Table 5.1 under the assumption CIA & IDA. This method, as an implementation of statistical matching, uses two separate applications of rpart each with their own built in cross-validation. Figure 6.2 compares the relative errors of statistical matching to data enrichment. Data enrichment is consistently better over all 6 campaigns we simulated in the great majority of replications. We also investigate for each estimator, how much of the predictive error is contributed by bias. It is well known that predictive mean square error can be decomposed as the sum of variance and squared bias. These quantities are typically unknown in practice, but can be evaluated in simulation studies.6 Numerical investigation 16 DEIR RMSE (%) Pool RMSE (%) 0.6 0.8 1.0 1.2 1 2 3 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Beer 5 6 7 8 9 10 3 4 5 6 7 8 9 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Chrome 0.6 0.8 1.0 1.2 1.4 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Salt 0.5 1.0 1.5 0 1 2 3 4 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 1 0.8 1.0 1.2 1.4 1.6 1.8 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 2 0.4 0.6 0.8 1.0 1.2 0.0 0.5 1.0 1.5 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Soap 3 Fig. 6.2: Performance comparison, statistical matching (data pooling) versus data enrichment, predictive relative mean square errors. There is one panel for each of 6 campaigns with one point for each of 1000 replicates. The reference line is the forty-five degree line. Table 6.1 reports the fractions of squared bias in predictive mean square errors for each method in all six studies. We see there that the error for statistical matching (data pooling) is dominated by bias while the error for SSP only is dominated by variance. These results are not surprising because the SSP only method has no sampling bias (only algorithmic bias) while the pooled data set has maximal sampling bias. The proportion of bias for DEIR is in between these extremes. Here we have less population bias than a typical data fusion situation because the TV and online-only panels were recruited in the same way. The bottom of Table 6.1 shows that DEIR is able to trade off bias and variance more effectively than SSP only or data pooling, because DEIR attains the smallest predictive mean squared error. Conclusions Predictions of incremental reach can be improved by making use of additional data. That improvement comes only if certain strong assumptions are true or at6 Numerical investigation 17 bias2 /mse Beer Chrome Salt Soap 1 Soap 2 Soap 3 SSP 0.35 0.42 0.26 0.12 0.28 0.12 Pool 0.88 0.82 0.88 0.88 0.88 0.93 DEIR 0.49 0.59 0.47 0.33 0.47 0.39 mse Beer Chrome Salt Soap 1 Soap 2 Soap 3 SSP 1.02 7.76 0.89 0.84 1.26 0.66 Pool 0.82 7.39 0.80 0.86 1.12 0.78 DEIR 0.61 5.42 0.48 0.52 0.68 0.42 Tab. 6.1: The upper rows show the fraction bias2 /mse of the mean squared prediction error due to bias for 3 methods to estimate incremental reach in 6 campaigns. The lower rows show the total mse, that is bias2 + var. least approximately true. Our only guide to the accuracy of those assumptions may come from the data themselves. Our data enriched incremental reach estimate uses a shrinkage strategy to pool estimates using different assumptions. Cross-validating the level of pooling gave us an algorithm that worked better than either ignoring the additional data or treating it the same as the unbiased data. Acknowledgment This project was not part of Art Owen’s Stanford responsibilities. His participation was done as a consultant at Google. The authors would like to thank Penny Chu, Tony Fagan, Yijia Feng, Jerome Friedman, Yuxue Jin, Daniel Meyer, Jeffrey Oldham and Hal Varian for support and constructive comments. References Breiman, L., Friedman, J. H. Olshen, R. A., and Stone, C. J. (1985). Classifi- cation and Regression Trees. Chapman & Hall/CRC, Baton Rouge, FL. Chen, A., Owen, A. B., and Shi, M. (2013). Data enriched linear regression. Technical report, Google. http://arxiv.org/abs/1304.1837. Collins, J. and Doe, P. (2009). Developing an integrated television, print and consumer behavior database from national media and purchasing currency data sources. In Worldwide Readership Symposium, Valencia. Doe, P. and Kudon, D. (2010). Data integration in practice: connecting currency and proprietary data to understand media use. ARF Audience Measurement 5.0.A Appendix 18 D’Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory and Practice. Wiley, Chichester, UK. Gilula, Z., McCulloch, R. E., and Rossi, P. E. (2006). A direct approach to data fusion. Journal of Marketing Research, XLIII:73–83. Jin, Y., Shobowale, S., Koehler, J., and Case, H. (2012). The incremental reach and cost efficiency of online video ads over TV ads. Technical report, Google. Lehmann, E. L. and Romano, J. P. (2005). Testing statistical hypotheses. Springer, New York, Third edition. Little, R. J. A. and Rubin, D. B. (2009). Statistical Analysis with Missing Data. John Wiley & Sons Inc., Hoboken, NJ, 2nd edition. R Core Team (2012). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. R¨assler, S. (2004). Data fusion: identification problems, validity, and multiple imputation. Austrian Journal of Statistics, 33(1&2):153–171. Singh, A. C., Mantel, H., Kinack, M., and Rowe, G. (1993). Statistical matching: Use of auxiliary information as an alternative to the conditional independence assumption. Survey Methodology, 19:59–79. Stein, C. M. (1956). Inadmissibility of the usual estimator for the mean of a multivariate normal distribution. In Proceedings of the Third Berkeley symposium on mathematical statistics and probability, volume 1, pages 197–206. Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribution. The Annals of Statistics, 9(6):1135–1151. The Nielsen Company (2011). The cross-platform report. Quarter 2, U.S. Therneau, T. M. and Atkinson, E. J. (1997). An introduction to recursive partitioning using the RPART routines. Technical Report 61, Mayo Clinic. A Appendix A.1 Variance reduction by IDA Recall that f = n/(n + N) and F = N/(n + N) are sample size proportions of the two data sets. Under the IDA we may estimate incremental reach by ˆθI = (fZ¯ S + FZ¯B) V¯ S Z¯ S = V¯ I  f + F Z¯B Z¯ S  .A Appendix 19 By the delta method (Lehmann and Romano, 2005), var(ˆθI) is approximately var( f ˆθI) = var(V¯ S)  ∂ ˆθI ∂V¯ S 2 + var(Z¯B)  ∂ ˆθI ∂Z¯B 2 + var(Z¯ S)  ∂ ˆθI ∂Z¯ S 2 + 2cov(V¯ S,Z¯ S) ∂ ˆθI ∂V¯ S ∂ ˆθI ∂Z¯ S , with partial derivatives evaluated with expectations E(V¯ S), E(Z¯ S), and E(Z¯B) replacing the corresponding random quantities. The other two covariances are zero because the S and B samples are independent. From the binomial distribution we have var(V¯ S) = θ(1 − θ)/n, var(Z¯B) = pz(1 − pz)/N and var(Z¯ S) = pz(1 − pz)/n. Also cov(V¯ S,Z¯ S) = 1 n E(ViZi) − E(Vi)E(Zi)  = θ(1 − pz)/n. After some calculus, var( f ˆθI) = θ(1 − θ) n + pz(1 − pz) N θ 2F 2 p 2 z + pz(1 − pz) n θ 2F 2 p 2 z − 2 θ(1 − pz) n θF pz = var(ˆθS) + θ 2F(1 − pz) pz  F N + F n − 2 n  = var(ˆθS) − θ 2F(1 − pz) pz 1 n = var(ˆθS)  1 − F 1 − pz pz θ 1 − θ  . A.2 Variance reduction by CIA Applying the delta method to ˆθC = Z¯ S(1 − Y¯ S), we find that var( f ˆθC) = var(Z¯ S)  ∂ ˆθC ∂Z¯ S 2 + var(Y¯ S) ∂ ˆθC ∂Y¯ S 2 + 2cov(Y¯ S,Z¯ S) ∂ ˆθC ∂Y¯ S ∂ ˆθC ∂Z¯ S = var(Z¯ S)(1 − py) 2 + var(Y¯ S)p 2 z + 2cov(Y¯ S,Z¯ S)(1 − py)pz. Here var(Z¯ S) = pz(1 − pz)/n, var(Z¯ S) = pz(1 − pz)/n, and under conditional independence cov(Y¯ S,Z¯ S) = 0. Thus var( f ˆθC) = 1 n pz(1 − pz)(1 − py) 2 + py(1 − py)p 2 z  = 1 n pz(1 − pz)(1 − py) 2 + py(1 − py)p 2 z  = pz(1 − py) n (1 − pz)(1 − py) + pypz  . When the CIA holds, θ = pz(1 − py). Note that var(ˆθS) = θ(1 − θ)/n. After some algebraic simplification we find that var( f ˆθC) var(ˆθS) = 1 − py(1 − pz) 1 − θ .A Appendix 20 A.3 Variance reduction by CIA and IDA When both assumptions hold we can estimate θ by ˆθI,C = (fZ¯ S + FZ¯B)(1 − Y¯ S). Under these assumptions, Z¯ S, Z¯B and Y¯ S are all independent, and var( f ˆθI,C) equals var(Z¯ S) ∂ ˆθI,C ∂Z¯ S 2 + var(Z¯B) ∂ ˆθI,C ∂Z¯B 2 + var(Y¯ S) ∂ ˆθI,C ∂Y¯ S 2 = pz(1 − pz) n f 2 (1 − py) 2 + pz(1 − pz) N F 2 (1 − py) 2 + py(1 − py) n p 2 z = pz(1 − py) n f(1 − py)(1 − pz) + pypz  after some simplification. As a result var( f ˆθI,C) var( f ˆθC) = f(1 − py)(1 − pz) + pypz (1 − py)(1 − pz) + pypz . A.4 Alternative algorithms We faced some design choices in our algorithm. First, we had to decide which estimators to include in our algorithm. We always include the unbiased choice ˆθS as well as two others. Second, we had to decide whether to use the entire BRP or to bootstrap sample it. We ran all six choices on simulations of all six data sets where we knew the correct answer. Table A.1 shows the mean squared errors for the six possible estimators on each of the six data sets. In every case we divided the mean squared error by that for the estimator combining ˆθS, ˆθC , and ˆθI,C without the bootstrap. We only see small differences, but the evidence favors choosing λI = 0 as well as not bootstrapping. Our default method is consistently the best in this table, although only by a small amount. We saw that data enrichment is consistently better than either pooling the data or ignoring the large sample, and by much larger amounts than we see in Table A.1. As a result, any of the data enrichment methods in this table would make a big improvement over either pooling the samples or ignoring the BRP.A Appendix 21 Estimators ˆθS, ˆθI , ˆθC ˆθS, ˆθI , ˆθI,C ˆθS, ˆθC , ˆθI,C BRP All Boot All Boot All Boot Beer 1.02 1.02 1.00 1.01 1 1.01 Chrome 1.04 1.04 1.01 1.01 1 1.00 Salt 1.04 1.04 1.01 1.01 1 1.01 Soap 1 1.04 1.05 1.01 1.02 1 1.00 Soap 2 1.05 1.05 1.01 1.03 1 1.01 Soap 3 1.02 1.02 1.01 1.00 1 1.00 Tab. A.1: Relative performance of our estimators on six problems. The relative errors are mean squared prediction errors normalized to the case that uses ˆθS, ˆθC , ˆθI,C without bootstrapping. The relative error for that case is 1 by definition. Coupled and k-Sided Placements: Generalizing Generalized Assignment Madhukar Korupolu1 , Adam Meyerson1 , Rajmohan Rajaraman2 , and Brian Tagiku1 1 Google, 1600 Amphitheater Parkway, Mountain View, CA. Email: {mkar,awmeyerson,btagiku}@google.com 2 Northeastern University, Boston, MA 02115. Email: rraj@ccs.neu.edu Abstract. In modern data centers and cloud computing systems, jobs often require resources distributed across nodes providing a wide variety of services. Motivated by this, we study the Coupled Placement problem, in which we place jobs into computation and storage nodes with capacity constraints, so as to optimize some costs or profits associated with the placement. The coupled placement problem is a natural generalization of the widely-studied generalized assignment problem (GAP), which concerns the placement of jobs into single nodes providing one kind of service. We also study a further generalization, the k-Sided Placement problem, in which we place jobs into k-tuples of nodes, each node in a tuple offering one of k services. For both the coupled and k-sided placement problems, we consider minimization and maximization versions. In the minimization versions (MinCP and MinkSP), the goal is to achieve minimum placement cost, while incurring a minimum blowup in the capacity of the individual nodes. Our first main result is an algorithm for MinkSP that achieves optimal cost while increasing capacities by at most a factor of k + 1, also yielding the first constant-factor approximation for MinCP. In the maximization versions (MaxCP and MaxkSP), the goal is to maximize the total weight of the jobs that are placed under hard capacity constraints. MaxkSP can be expressed as a k-column sparse integer program, and can be approximated to within a factor of O(k) factor using randomized rounding of a linear program relaxation. We consider alternative combinatorial algorithms that are much more efficient in practice. Our second main result is a local search based approximation algorithm that yields a 15- approximation and O(k 3 )-approximation for MaxCP and MaxkSP respectively. Finally, we consider an online version of MaxkSP and present algorithms that achieve logarithmic competitive ratio under certain necessary technical assumptions. 1 Introduction The data center has become one of the most important assets of a modern business. Whether it is a private data center for exclusive use or a shared public cloud data center, the size and scale of the data center continues to rise. As a companygrows, so too must its data center to accommodate growing computational, storage and networking demand. However, the new components purchased for this expansion need not be the same as the components already in place. Over time, the data center becomes quite heterogeneous [1]. This complicates the problem of placing jobs within the data center so as to maximize performance. Jobs often require resources of more than one type: for example, compute and storage. Modern data centers typically separate computation from storage and interconnect the two using a network of switches. As such, when placing a job within a data center, we must decide which computation node and which storage node will serve the job. If we pick nodes that are far apart, then communication latency may become too prohibitive. On the other hand, nodes are capacitated, so picking nodes close together may not always be possible. Most prior work in data center resource management is focussed on placing one type of resource at a time: e.g., placing storage requirements assuming job compute location is fixed [2, 3] or placing compute requirements assuming job storage location is fixed [4, 5]. One sided placement methods cannot suitably take advantage of the proximities and heterogeneities that exist in modern data centers. For example, a database analytics application requiring high throughput between its compute and storage elements can benefit by being placed on a storage node that has a nearby available compute node. In this paper, we study Coupled Placement (CP), which is the problem of placing jobs into computation and storage nodes with capacity constraints, so as to optimize costs or profits associated with the placement. Coupled placement was first addressed in [6] in a setting where we are required to place all jobs and we wish to minimize the communication latency over all jobs. They show that this problem, which we call MinCP, is NP-hard and investigate the performance of heuristic solutions. Another natural formulation is where the goal is to maximize the total number of jobs or revenue generated by the placement, subject to capacity constraints. We refer to this problem as MaxCP. We also study a generalization of Coupled Placement, the k-Sided Placement Problem (kSP), which considers k ≥ 2 kinds of resources. 1.1 Problem definition In the coupled placement problem, we are given a bipartite graph G = (U, V, E) where U is a set of compute nodes and V is a set of storage nodes. We have capacity functions C : U → R and S : V → R for the compute and storage nodes, respectively. We are also given a set T of jobs, each of which needs to be allocated to one compute node and one storage node. Each job may prefer some compute-storage node pairs more than others, and may also consume different resources at different nodes. To capture these heterogeneities, we have for each job j a function fj : E → R, a processing requirement pj : E → R and a storage requirement sj : E → R. We note that without loss of generality, we can assume that the capacities are unit, since we can scale the processing and storage requirements of individual nodes accordingly.We consider two versions of the coupled placement problems. For the maximization version MaxCP, we view fj as a payment function. Our goal is to select a subset A ⊆ T of jobs and an assignment σ : A → E such that all capacities are observed and our total profit P j∈A fj (σ(j)) is maximized. For the minimization version MinCP, we view fj as a cost function. Our goal is to find an assignment σ : T → E such that all capacities are observed and our total cost P j∈A fj (σ(j)) is minimized. A generalization of the coupled placement problem is k-sided placement (kSP), in which we have k different sets of nodes, S1, . . . , Sk, each set of nodes providing a distinct service. For each i, we have a capacity function Ci : Si → R that gives the capacity of a node in Si to provide the ith service. We are given a set T of jobs, each of which needs each kind of service; the exact resource needs may depend on the particular k-tuple of nodes from Q i Si to which it is assigned. That is, for each job j, we have a demand function dj : Q i Si → Rk . We also have another function fj : Q i Si → R. As for coupled placement, we can assume that the capacities are unit, since we can scale the demands of individual nodes accordingly. Similar to coupled placement, we consider two versions of kSP, MinkSP and MaxkSP. 1.2 Our Results All of the variants of CP and kSP are NP-hard, so our focus is on approximation algorithms. Our first set of results consist of the first non-trivial approximation algorithms for MinCP and MinkSP. Under hard capacity constraints, it is easy to see that it is NP-hard to achieve any bounded approximation ratio to cost minimization. So we consider approximation algorithms that incur a blowup in capacity. We say that an algorithm is α-approximate for the minimization version if its cost is at most that of an optimal solution, while incurring a blowup factor of at most α in the capacity of any node. – We present a (k + 1)-approximation algorithm for MinkSP using iterative rounding, yielding a 3-approximation for MinCP. We next consider the maximization version. MaxkSP can be expressed as a k-column sparse integer packing program (k-CSP). From this, it is immediate that MaxkSP can be approximated to within an O(k) approximation factor by applying randomized rounding to a linear programming relaxation [7]. An Ω(k/ log k)-inapproximability result for k-set packing due to [16] implies the same hardness result for MaxkSP. Our second main result is a simpler approximation algorithm for MaxCP and MaxkSP based on local search. – We present a local search based 15-approximation algorithm for MaxCP. We extend it to MaxkSP and obtain an O(k 3 )-approximation. The local search result applies directly to a version where we can assign tasks fractionally but only to a single pair of machines (this is like assigning a task with lower priority and may have additional applications). We then describe asimple rounding scheme to obtain an integral version. The rounding technique involves establishing a one-to-one correspondence between fractional assignments and machines. This is much like the cycle-removing rounding for GAP; there is a crucial difference, however, since coupled and k-sided placements assign jobs to tuples of machines. Finally, we study the online version of MaxCP, in which tasks arrive online and must be irrevocably assigned or rejected immediately upon arrival. – We extend the techniques of [8] to the case where the capacity requirement for a job is arbitrarily machine-dependent. This enables us to achieve competitive ratio logarithmic in the ratio of best to worst value-per-capacity density, under necessary technical assumptions about the maximum job size. 1.3 Related Work The coupled and k-sided placement problems are natural generalizations of the Generalized Assignment Problem (GAP), which can be viewed as a 1-sided placement problem. In GAP, which was first introduced by Shmoys and Tardos [9], the goal is assign items of various sizes to bins of various capacities. A subset of items is feasible for a bin if their total size is no more than the bin’s capacity. If we are required to assign all items and minimize our cost (MinGAP), Shmoys and Tardos [9] give an algorithm for computing an assignment that achieves optimal cost while doubling the capacities of each bin. A previous result by Lenstra et al. [10] for scheduling on unrelated machines show it is NP-hard to achieve optimal cost without incurring a capacity blowup of at least 3/2. On the other hand, if we wish to maximize our profit and are allowed to leave items unassigned (MaxGAP), Chekuri and Khanna [11] observe that the (1, 2)-approximation for MinGAP implies a 2-approximation for MaxGAP. This can be improved to a ( e e−1 )-approximation using LP-based techniques [12]. It is known that MaxGAP is APX-hard [11], though no specific constant of hardness is shown. On the experimental side, most prior work in data center resource management focusses on placing one type of resource at a time: for example, placing storage requirements assuming job compute location is fixed (file allocation problem [2], [13, 14, 3]) or placing compute requirements assuming job storage location is fixed [4, 5]. These in a sense are variants of GAP. The only prior work on Coupled Placement is [6], where they show that MinCP is NP-hard and experimentally evaluate heuristics: in particular, a fast approach based on stable marriage and knapsacks is shown to do well in practice, close to the LP optimal. The MaxkSP problem is related to the recently studied hypermatching assignment problem (HAP) [15], and special cases, including k-set packing, and a uniform version of the problem. A (k + 1 + ε)-approximation is given for HAP in [15], where other variants of HAP are also studied. While the MaxkSP problem can be viewed as a variant of HAP, there are critical differences. For instance, in MaxkSP, each task is assigned at most one tuple, while in the hypermatching problem each client (or task) is assigned a subset of the hyperedges. Hence, the MaxkSP and HAP problems are not directly comparable. The k-set packing canbe captured as a special case of MaxkSP, and hence the Ω(k/ log k)-hardness due to [16] applies to MaxkSP as well. 2 The minimization version Next, we consider the minimization version of the Coupled Placement problem, MinCP. We write the following integer linear program for MinCP, where xtuv is the indicator variable for the assignment of t to pair (u, v), u ∈ U, v ∈ V . Minimize: X t,u,v xtuvft(u, v) Subject to: X u,v xtuv ≥ 1, ∀t ∈ T, X t,v pt(u, v)xtuv ≤ cu, ∀u ∈ U, X t,u st(u, v)xtuv ≤ dv, ∀v ∈ V, xtuv ∈ {0, 1}, ∀t ∈ T, u ∈ U, v ∈ V. We refer the first set of constraints as satisfaction constraints, the second and third set as capacity constraints (processing and storage). We consider the linear relaxation of this program which replaces the integrality constraints above with 0 ≤ xtuv ≤ 1, ∀t ∈ T, u ∈ U, v ∈ V . 2.1 A 3-approximation algorithm for MinCP We now present algorithm IterRound, based on iterative rounding [21], which achieves a 3-approximation for MinCP. We start with a basic algorithm that achieves a 5-approximation by identifying tight constraints with a small number of variables. Each iteration of this algorithm repeats the following round until all variables have been rounded. 1 Extreme point: Compute an extreme point solution x to the current LP. 2 Eliminate variable or constraint: Execute one of these two steps. By Lemma 3, one of these steps can always be executed if the LP is nonempty. a Remove from the LP all variables xtuv that take the value 0 or 1 in x. If xtuv is 1, then assign job t to the pair (u, v), remove the job t and its associated variables from the LP, and reduce cu by pt(u, v) and dv by st(u, v). b Remove from the LP any tight capacity constraint with at most 4 variables. Fix an iteration of the algorithm, and an extreme point x. Let nt, nc, and ns denote the number of tight task satisfaction constraints, computation constraints, and storage constraints, respectively, in x. Note that every task satisfaction constraint can be assumed to be tight, without loss of generality. Let N denote the number of variables in the LP. Since x is an extreme point, if all variables in x take values in (0, 1), then we have N = nt + nc + ns.Lemma 1. If all variables in x take values in (0, 1), then nt ≤ N/2. Proof. Since a variable only occurs once over all satisfaction constraints, if nt > N/2, there exists a satisfaction constraint that has exactly one variable. But then, this variable needs to take value 1, a contradiction. Lemma 2. If nt ≤ N/2, then there exists a tight capacity constraint that has at most 4 variables. Proof. If nt ≤ N/2, then ns + nc = N − nt ≥ N/2. Since each variable occurs in at most one computation constraint and at most one storage constraint, the total number of variable occurrences over all tight storage and computation constraints is at most 2N, which is at most 4(ns +nc). This implies that at least one of these tight capacity constraints has at most 4 variables. Using Lemmas 1 and 2, we can argue that the above algorithm yields a 5- approximation. Step 2a does not cause any increase in cost or capacity. Step 2b removes a constraint, hence cannot increase cost; since the removed constraint has at most 4 variables, the total demand allocated on the relevant node is at most the demand of four tasks plus the capacity already used in earlier iterations. Since each task demand is at most the capacity of the node, we obtain a 5- approximation with respect to capacity. Studying the proof of Lemma 2 more closely, one can separate the case nt < N/2 from the nt = N/2; in the former case, one can, in fact, show that there exists a tight capacity constraint with at most 3 variables. Together with a careful consideration of the nt = N/2 case, one can improve the approximation factor to 4. We now present an alternative selection of tight capacity constraint that leads to a 3-approximation. One interesting aspect of this step is that the constraint being selected may not have a small number of variables. We replace step 2b by the following. 2b Remove from the LP any tight capacity constraint in which the number of variables is at most two more than the sum of the values of the variables. Lemma 3. If all variables in x take values in (0, 1), then there exists a tight capacity constraint in which the number of variables is at most two more than the sum of the values of the variables. Proof. Since each variable occurs in at most two tight capacity constraints, the total number of occurrences of all variables across the tight capacity constraints is 2N − s for some nonnegative integer s. Since each satisfaction constraint is tight, each variable appears in 2 capacity constraints, and each variable takes on value less than 1, the sum of all the variables over the tight capacity constraints is at least 2nt − s. Therefore, the sum, over all tight capacity constraints, of the difference between the number of variables and their sum is at most 2(N − nt). Since there are N − nt tight capacity constraints, for at least one of these constraints, the difference between the number of variables and their sum is at most 2.Lemma 4. Let u be a node with a tight capacity constraint, in which the number of variables is at most 2 more than the sum of the variables. Then, the sum of the capacity requirements of the tasks partially assigned to u is a most the current available capacity of u plus twice the capacity of u. Proof. Let ` be the number of variables in the constraint for u, and let the associated tasks be numbered 1 through `. Let the demand of task j for the capacity of node u be dj . Then, the capacity constraint for u is P j djxj = bc(u), where bc(u) is the available capacity of u in the current LP. We know that ` − P i xi ≤ 2. Since di ≤ C(u), the capacity of u: X j dj = bc(u) +X ` j=1 (1 − xj )dj ≤ bc(u) + (` − Xm j=` xj )C(u) ≤ bc(u) + 2C(u). Theorem 1. IterRound is a polynomial-time 3-approximation algorithm for MinCP. Proof. By Lemma 3, each iteration of the algorithm removes either a variable or a constraint from the LP. Hence the algorithm is polynomial time. The elimination of a variable that takes value 0 or 1 does not change the cost. The elimination of a constraint can only decrease cost, so the final solution has cost no more than the value achieved by the original LP. Finally, when a capacity constraint is eliminated, by Lemma 4, we incur a blowup of at most 3 in capacity. 2.2 A (k + 1)-approximation algorithm for MinkSP It is straightforward to generalize the the algorithm of the preceding section to obtain a k + 1-approximation to MinkSP. We first set up the integer LP for MinkSP. For a given element e ∈ Q i Si , we use ei to denote the ith coordinate of e. Let xte be the indicator variable that t is assigned to e ∈ Q i Si . Minimize: X t,e xteft(e) Subject to: X e xte ≥ 1, ∀t ∈ T, X t,e:ei=u (dt(e))ixte ≤ Ci(u), ∀1 ≤ i ≤ k, u ∈ U, xte ∈ {0, 1}, ∀t ∈ T, e ∈ E The algorithm, which we call IterRound(k), is identical to IterRound of Section 2.1 except that step 2b is replaced by the following. 2b Remove from the LP any tight capacity constraint in which the number of variables is at most k more than the sum of the values of the variables. The claims and proofs are almost identical to the k = 2 case and are moved to Appendix A. A natural question to ask is whether a linear approximation factor of MinkSP is unavoidable for polynomial time algorithms. Unfortunately, we donot have any non-trivial results in this direction. We have been able to show that the MinkSP linear program has an integrality that grows as Ω(log k/ log log k) (see Appendix A). 3 The maximization problems We present approximation algorithms for the maximization versions of coupled placement and k-sided placement problems. We first observe, in Section 3.1, that these problems reduce to column sparse integer packing. We next present, in Section 3.2, an alternative combinatorial approach based on local search. 3.1 An LP-based approximation algorithm One can write a positive integer linear program for MaxCP. Let xtuv denote the indicator variable for the the assignment of job t to the pair (u, v), u ∈ U, v ∈ V . The goal is then to Maximize: X t,u,v xtuvft(u, v) Subject to: X u,v xtuv ≤ 1, ∀t ∈ T, X t,v pt(u, v)xtuv ≤ cu, ∀u ∈ U, X t,u st(u, v)xtuv ≤ dv, ∀v ∈ V, xtuv ∈ {0, 1}, ∀t ∈ T, u ∈ U, v ∈ V. Note that we can deal with capacities on u, v by scaling the pt(u, v) and st(u, v) values appropriately. The above LP can be easily extended to MaxkSP (see Appendix B). These linear programs are 3- and k-column sparse packing programs, respectively, and can be approximated to within a factor of 15.74 and ek + o(k), respectively using a clever randomized rounding approach. We next give a combinatorial approach based on local search which is likely to be much more efficient in practice. 3.2 Approximation algorithms based on local search Before giving the details, we start with a few helpful definitions. For any u ∈ U, Fu = Σt,vxtuvft(u, v). Similarly, for any v ∈ V , Fv = Σt,uxtuvft(u, v). We set µ = 1 n maxt,u,v ft(u, v). It follows that the optimum solution is at least nµ and at most n 2µ. The local search algorithm will maintain the following two invariants: (1) For each t, there is at most one pair (u, v) for which xtuv > 0; (2) All the linear program inequalities hold. It’s easy to set an initial state where the invariant holds (all xtuv = 0). The local search algorithm proceeds in the following steps: While ∃t, u, v : ft(u, v) > Fu pt(u,v) cu + Fv st(u,v) dv + Σu0 ,v0xtu0v 0ft(u 0 , v0 ) + µ:1. Set xtuv = 1 and set xtu0v 0 = 0 for all (u 0 , v0 ) 6= (u, v). 2. While Σt,vpt(u, v)xtuv > cu, reduce xtuv for the job with minimum cuft(u, v)/pt(u, v) such that xtuv > 0. 3. While Σu,vst(u, v)xtuv > dv, reduce xtuv for the job with minimum dvft(u, v)/st(u, v) such that xtuv > 0 Theorem 2. The local search algorithm maintains the two stated invariants. Proof. The first invariant is straightforward, because the only time we increase an xtuv value we simultaneously set all other values for the same t to zero. The only time the linear program inequalities can be violated is immediately after setting xtuv = 1. However, the two steps immediately after this operation will reduce the values of other jobs so as to satisfy the inequalities (and this is done without increasing any xtuv so no new constraint can be violated). Theorem 3. The local search algorithm produces a 3+ approximate fractional solution satisfying the invariants. Proof. When the algorithm terminates, we have for all t, u, v: ft(u, v) ≤ Fu pt(u,v) cu + Fv st(u,v) dv + Σu0 ,v0xtu0v 0ft(u 0 , v0 )µ. We sum this over t, u, v representing the optimum integer assignments: OP T ≤ ΣuFu + ΣvFv + Σt,u,vxtuvft(u, v) + OP T. Each summation simplifies to the algorithm’s objective value, giving the result. Theorem 4. The local search algorithm runs in polynomial time. Proof. Setting xtuv = 1 and setting all other xtu0v 0 = 0 adds ft(u, v)−Σu0v 0xtu0v 0ft(u 0 , v0 ) to the algorithm’s objective. The next two steps of the algorithm (making sure the LP inequalities hold) reduce the objective by at most Fu pt(u,v) cu + Fv st(u,v) dv . It follows that each iteration of the main loop increases the solution value by at least µ. By definition of µ, this can happen at most n 2/ times. Each selection of (t, u, v) can be done in polynomial time (at worst, by simply trying all tuples). Rounding Phase: When the local search algorithm terminates, we have a fractional solution with the additional guarantee from the first invariant. Note that we can extend this to the k-sided version if we increase the approximation factor to k+1+. Below, we give two different rounding schemes. The first works for general values of k and loses an O(k 2 ) factor, for an overall approximation factor of O(k 3 ). The second is specific to the k = 2 case and obtains a better approximation. 1. We randomly make each assignment with probability p times the fractional value (so pxtuv for Coupled Placement), for some p to be defined later. 2. For each assigned job t, if the other jobs t 0 6= t assigned to any one of its assigned machines violate the corresponding linear program constraint, we immediately drop job t. For Coupled Placement this means if P t 06=t,v pt 0 (u, v)xt 0uv > 1 for any t, u we set xtuv = 0.3. Note that we may still violate linear program constraints, but for any particular machine the constraint would be satisfied if we dropped any one of its assigned jobs. We divide the assigned jobs into k + 1 groups. These groups should guarantee that for any machine with at least two assigned jobs, not all its jobs are members of the same group. We then select the group with largest total objective value as our final solution. Theorem 5. For the k-sided version, the rounding scheme runs in poly-time and achieves an O(k 2 )-approximation over the fractional approximation factor (so an overall factor of O(k 3 ) using local search) for appropriate choice of p. Proof. The first two steps finish with a solution of value at least p(1 − p) k times the optimum in expectation. This is because for any job t, the probability of placing this job in step one is exactly p times its fractional value. Consider any machine m where the job is assigned; the expected total size of the other jobs t 0 6= t assigned to this machine is at most pcm and thus the probability that these other jobs exceed cm is at most p. The probability that none of the k machines where t is assigned exceed capacity from other jobs will be at most (1 − p) k . We may still violate constraints. Dividing into k + 1 groups and picking the best gives a result which is at least 1 k+1 p(1−p) k times optimum without violating constraints. Selecting p = 1 k gives the desired approximation factor. It remains to show that the division into groups can be performed in polytime. We start with all machines unmarked. For each group, we select a maximal set of jobs no two of which are assigned the same unmarked machine. We then mark all machines to which one of our current group of jobs is assigned. Note that immediately before we select group i, each remaining job is assigned to at most k−i+1 unmarked machines. For i = 1 this is obvious. Inductively, suppose that job j is assigned to more than k −i unmarked machines immediately before selecting group i + 1. Before selecting group i, job j was assigned to at most k −i+ 1 unmarked machines, and since we never “unmark” a machine it follows that job j was assigned to exactly k − i + 1 unmarked machines both before and after the selection of group i. But then none of the jobs selected in group i are assigned to any of the unmarked machines assigned to job j (else they would have become marked after selection of group i). So we can augment group i with job j without violating the constraint that no two jobs of group i are on the same unmarked machine. This contradicts the maximality of group i. We thus conclude that immediately before we select group k + 1, each remaining job is assigned only to marked machines. Thus group k + 1 selects all remaining jobs (maximality) and the jobs are divided into k+1 groups. Consider any machine m with at least two assigned jobs. Let group i be the first group to contain a job from m. Thus prior to selection of group i, we had not selected any job which was assigned to m and m was unmarked. So group i cannot include more than one job from machine m without violating the condition that no two jobs share an unmarked machine. It follows that there are at least two distinct groups which contain jobs from machine m (group i and also some later group).For MaxCP, we can improve the approximation factor. We refer the reader to Appendix B for details. Theorem 6. For MaxCP, there exists a polynomial-time algorithm based on local search that achieves a 15 +  approximation for MaxCP. 4 Online MaxCP and MaxkSP We now study the online version of MaxCP, in which jobs arrive in an online fashion. When a job arrives we must irrevocably assign it or reject it. Our goal is to maximize our total value at the end of the instance. We apply the techniques of [8] to obtain a logarithmic competitive online algorithm under certain assumptions. We first note that online MaxCP differs from the model considered in [8] in that a job’s computation/storage requirements need not be the same. As demonstrated in [8] certain assumptions have to be made to achieve competitive ratios of any interest. We extend these assumptions for the MaxCP model as follows: Assumption 1 There exists F such that for all t, u, v either ft(u, v) = 0 or 1 ≤ ft(u, v) ≤ F min( pt(u,v) cu , st(u,v) dv ). Assumption 2 For  = min( 1 2 , 1 ln 2F +1 ), for all t, u, v: pt(u, v) ≤ cu and st(u, v) ≤ dv. It is not hard to show that they (or some similar flavor of these assumptions) are in fact necessary to obtain any interesting competitive ratios (proof in Appendix C). Theorem 7. No deterministic online algorithm can be competitive over classes of instances where either one of the following is true: (i) job size is allowed to be arbitrarily large relative to capacities, or (ii) job values and resource requirements are completely uncorrelated. A small modification to the algorithm of [8] gives an O(log F)-competitive algorithm. Moreover, the lower bound of Ω(log F) shown in [8] applies to online MaxCP as well. (See Appendix D for proof.) Theorem 8. There exists a deterministic O(log F)-competitive algorithm for online MaxCP under Assumptions 1 and 2. For MaxkSP, this can be extended to a O(log kF)-competitive algorithm. Moreover, any online deterministic algorithm for online MaxCP has competitive ratio Ω(log F), and for online MaxkSP has competitive ratio Ω(log kF). Theorem 9. There exist a randomized O(log F)-competitive algorithm (in expectation) for online MaxCP under assumption 1 even if we weaken assumption 2 to require only that  = 1 2 . No deterministic online algorithm for the problem can accomplish such a result.Acknowledgments We would like to thank Aravind Srinivasan for helpful discussions, and for pointing us to the Ω(k/ log k)-hardness result for k-set packing, in particular. We thank anonymous referees for helpful comments on an earlier version of the paper, and are especially grateful to a referee who generously offered the key insights leading to improved results for MinCP and MinkSP. References 1. Patterson, D.A.: Technical perspective: the data center is the computer. Communications of the ACM 51 (January 2008) 105–105 2. Dowdy, L.W., Foster, D.V.: Comparative models of the file assignment problem. ACM Surveys 14 (1982) 3. Anderson, E., Kallahalla, M., Spence, S., Swaminathan, R., Wang, Q.: Quickly finding near-optimal storage designs. ACM Transactions on Computer Systems 23 (2005) 337–374 4. Appleby, K., Fakhouri, S., Fong, L., Goldszmidt, G., Kalantar, M., Krishnakumar, S., Pazel, D., Pershing, J., Rochwerger, B.: Oceano-SLA based management of a computing utility. In: Proceedings of the International Symposium on Integrated Network Management. (2001) 855–868 5. Chase, J.S., Anderson, D.C., Thakar, P.N., Vahdat, A.M., Doyle, R.P.: Managing energy and server resources in hosting centers. In: Proceedings of the Symposium on Operating Systems Principles. (2001) 103–116 6. Korupolu, M., Singh, A., Bamba, B.: Coupled placement in modern data centers. In: Proceedings of the International Parallel and Distributed Processing Symposium. (2009) 1–12 7. Bansal, N., Korula, N., Nagarajan, V., Srinivasan, A.: On k-column sparse packing programs. In: Proceedings of the Conference on Integer Programming and Combinatorial Optimization. (2010) 369–382 8. Awerbuch, B., Azar, Y., Plotkin, S.: Throughput-competitive on-line routing. In: Proceedings of the Symposium on Foundations of Computer Science. (1993) 32–40 9. Shmoys, D.B., Eva Tardos: An approximation algorithm for the generalized as- ´ signment problem. Mathematical Programming 62(3) (1993) 461–474 10. Lenstra, J.K., Shmoys, D.B., Eva Tardos: Approximation algorithms for scheduling ´ unrelated parallel machines. Mathematical Programming 46(3) (1990) 259–271 11. Chekuri, C., Khanna, S.: A PTAS for the multiple knapsack problem. In: Proceedings of the Symposium on Discrete Algorithms. (2000) 213–222 12. Fleischer, L., Goemans, M.X., Mirrokni, V.S., Sviridenko, M.: Tight approximation algorithms for maximum general assignment problems. In: SODA. (2006) 611–620 13. Alvarez, G.A., Borowsky, E., Go, S., Romer, T.H., Becker-Szendy, R., Golding, R., Merchant, A., Spasojevic, M., Veitch, A., Wilkes, J.: Minerva: An automated resource provisioning tool for large-scale storage systems. Transactions on Computer Systems 19 (November 2001) 483–518 14. Anderson, E., Hobbs, M., Keeton, K., Spence, S., Uysal, M., Veitch, A.: Hippodrome: Running circles around storage administration. In: Proceedings of the Conference on File and Storage Technologies. (2002) 175–188 15. Cygan, M., Grandoni, F., Mastrolilli, M.: How to sell hyperedges: The hypermatching assignment problem. In: SODA. (2013) 342–35116. Hazan, E., Safra, S., Schwartz, O.: On the complexity of approximating k-set packing. Computational Complexity 15(1) (2006) 20–39 17. Vazirani, V.V.: Approximation Algorithms. Springer-Verlag (2001) 18. Frieze, A.M., Clarke, M.: Approximation algorithms for the m-dimensional 0-1 knapsack problem: Worst-case and probabilistic analyses. European Journal of Operational Research 15(1) (1984) 100–109 19. Chekuri, C., Khanna, S.: On multi-dimensional packing problems. In: Proceedings of the Symposium on Discrete Algorithms. (1999) 185–194 20. Srinivasan, A.: Improved approximations of packing and covering problems. In: Proceedings of the Symposium on Theory of Computing. (1995) 268–276 21. Lau, L., Ravi, R., Singh, M.: Iterative Methods in Combinatorial Optimization. Cambridge Texts in Applied Mathematics. Cambridge University Press (2011) A Proofs for MinkSP Fix an iteration of the algorithm, and an extreme point x. Let nt denote the number of tight satisfaction constraints, and ni denote the number of tight capacity constraints on the ith side. Since x is an extreme point, if all variables in x take values in (0, 1), then we have N = nt + P i ni . Lemma 5. If all variables in x take values in (0, 1), then there exists a tight capacity constraint in which the number of variables is at most k more than the sum of the variables. Proof. Since each variable occurs in at most k tight capacity constraints, the total number of occurrences of all variables across the tight capacity constraints is kN − s for some nonnegative integer s. Since each satisfaction constraint is tight, each variable appears in k capacity constraints, and each variable takes on value at most 1, the sum of all the variables over the tight capacity constraints is at least knt − s. Therefore, the sum, over all tight capacity constraints, of the difference between the number of variables and their sum is at most k(N − nt). Since the number of tight capacity constraints is N −nt, for at least one of these constraints, the difference between the number of variables and their sum is at most k. Lemma 6. Let u be a side-i node with a tight capacity constraint, in which the number of variables is at most k more than the sum of the variables. Then, the sum of the capacity requirements of the tasks partially assigned to u is at most the available capacity of u plus kCi(u). Proof. Let ` be the number of variables in the constraint for u, and let the associated tasks be numbered 1 through `. Let the demand of task j for the capacity of node u be dj . Then, the capacity constraint for u is P j djxj = bc(u).We know that m − P i xi ≤ k. We also have di ≤ Ci(u). Letting bc(u) denote the current capacity of u, we now derive X i di = bc(u) +Xm j=1 (1 − xi)di ≤ bc(u) + (m − Xm j=1 xi)Ci(u) ≤ bc(u) + kCi(u). Theorem 10. IterRound(k) is a polynomial-time k + 1-approximation algorithm for MinkSP. Proof. By Lemma 5, each iteration of the algorithm removes either a variable or a constraint from the LP. Hence the algorithm is polynomial time. The elimination of a variable that takes value 0 or 1 neither changes cost nor incurs capacity blowup. The elimination of a constraint can only decrease cost, so the final solution has cost no more than the value achieved by the original LP. Finally, by Lemma 6, we incur a blowup of at most 1 + k in capacity. We now show that the MinkSP LP has an integrality gap of Ω(log k/ log log k). We recursively construct an integrality gap instance with ` t sides, for parameters ` and t, with two nodes per side one with infinite capacity and the other with unit capacity, such that any integral solution has at least t tasks on the unit-capacity node on some side, while there is a fractional solution with load of at most t/` on the unit-capacity node of each side. Setting t = ` and k = ` ` , we obtain an instance in which the capacity used by the fractional solution is 1, while any integral solution has load ` = Θ(log k/ log log k). Each task can be placed on one tuple from a subset of tuples; for a given tuple, the demand of the task on each side of the tuple is one. We start with the construction for t = 1. We introduce a task that has ` choices, the ith choice consisting of the unit-capacity node from side i and infinite capacity nodes on all other sides. Clearly, any integral solution uses up unit capacity of one unitcapacity node, while there is a fractional solution (1/` for each choice) that uses only 1/` fraction of each unit capacity node. Given a construction for ` t sides, we show how to extend to ` t+1 sides. We take ` identical copies of the instance with ` t sides and combine the tuples for each task in such a way that for any i, any integral placement places exactly the same task on side i of each copy. Now we add task t + 1 which can be placed in one of ` tuples: unit capacity node on all sides of copy i and infinite capacity node on all other sides, for each i. Clearly, any integral solution will have to add one more task to a unit-capacity node of a side that already has load t, yielding a load t + 1, while a fractional solution can assign load of at most 1/` to the unit-capacity nodes of each side.B Proofs for MaxkSP and MaxCP We first present the linear program for MaxkSP (recall the definition in Section 1.1). Let xte denote the indicator variable for the assignment of job t to the k-tuple e. Maximize: X t,e xteft(e) Subject to: X e xte ≤ 1, ∀t ∈ T, X t,e (dt(e))ixte ≤ Ci(ei), ∀i ∈ {1, . . . , k}, xte ∈ {0, 1}, ∀t ∈ T, e ∈ Q i Si . We now present the improved approximation algorithm for MaxCP. The idea is to obtain a one-to-one correspondance between fractional assignments and machines. Essentially we view the machines as nodes of a graph where the edges are the fractional assignments (this is similar to the rounding for generalized assignment). If we have a cycle, the idea is to shift the fractions around the cycle (i.e. increase one xtuv then decrease some xt 0vw and increase some xt 00wx and so forth). Applying this directly on a single cycle may violate some constraints; while we try to increase and decrease the fractions in such a way that constraints hold, since each job has different “size” on its two endpoints we may wind up violating the constraint P t,v xtuvpt(u, v) at a single node u. This prevents us from doing a simple cycle elimination as in generalized assignment. However, if we have two adjoining (or connected) cycles the process can be made to work. The remaining case is a single cycle, where we can assign each edge to one of its endpoints. Generalized assignment rounding would now proceed to integrally assign each job to its corresponding machine; we cannot do this because each job requires two machines, and each machine thus has multiple fractional assignments (all but one of which “correspond” to some other machine). Lemma 7. Given any fractional solution which satisfies the local search invariants, we can produce an alternative fractional solution (also satisfying the local search invariants and with equal or greater value). This new fractional solution labels each job t with 0 < xtuv < 1, with either u or v, guaranteeing that each u is labeled with at most one job. Proof. Consider a graph where the nodes are machines, and we have an edge (u, v) for any fractional assignment 0 < xtuv < 1. If any node has degree zero or one, we remove that node and its assigned edge (if any), labeling the removed edge with the node that removed it. We continue this process until all remaining nodes have degree at least two. If there is a node of degree three, then there must exist two (distinct but not necessarily edge-disjoint) cycles with a path between them (possibly a path of length zero); since the graph is bipartite all cycles are even in length. We can alternately increase and decrease the fractional assignments of edges along a cycle such that the total load P t,v pt(u, v)xtuv changesonly on a single node u where the path between cycles intersects this cycle. We can do the same along the other cycle. We can then do the same thing along the path, and equalize the changes (multiplicatively) such that there is no overall change in load, but at least one edge has its fractional value changing. If this process decreases the value, we can reverse it to increase the value. This allows us to modify the fractional solution in a way that increases the number of integral assignments without decreasing the value. After applying this repeatedly (and repeating the node/edge removal process above where necessary), we are left with a graph that consists only of node-disjoint cycles. Each of the remaining edges will be labeled with one of its two endpoints (one to each). The overall effect is that we have a one-to-one labeling correspondance between fractional assignments and machines (each fractional edge to one of its two assigned machines). Note however that since each job is assigned to two machines and labeled with only one of the two, this does not imply that each machine has only one fractional assignment. Once this is done, we consider three possible solutions. One consists of all the integral assignments. The second considers only those assignments which are fractional and labeled with nodes u. For each node v, we select a subset of its fractional assignments to make integrally, so as to maximize the value without violating capacity of v. We cannot violate capacity of u because we select at most one job for each such machine. The result has at least 1 2 the value of assignments labeled with nodes u. For the third solution, we do the same but with the roles of u, v reversed. We select the best of these three solutions; our choice obtains at least 1 5 of the overall value. Proof of Theorem 6: The algorithm sketch contains most of the proof. We need to establish that we can get at least 1 2 the fractional value on a single machine integrally. This can be done by selecting jobs in decreasing order of density (ft(u, v)/pt(u, v)) until we overflow the capacity. Including the job that overflows capacity, this must be better than the fractional solution. Thus we can select either everything but the job that overflows capacity, or that job by itself. We also need to establish the 1 5 value claim. If we were to select the integral assignments with probability 1 5 and each of the other two solutions with probability 2 5 , we would get an expected 1 5 of the fractional solution. Deterministially selecting the best of the three solutions can only be better than this. ut C Proof of Theorem 7 We first show that if resource requirements are large compared to capacities, payment functions ft are exactly equal to the total amount of resources and each job requires the same amount over all resources/dimensions (but different jobs can require different amounts), then no deterministic online algorithm can be competitive. Consider a graph G with a single compute node and a single data storage node. Each node has one-dimensional compute/storage capacity of L. A jobarrives requesting 1 unit of computing and storage and will pay 2. Clearly, any competitive deterministic algorithm must accept this job, in case this is the only job. However, a second job arrives requesting L units of computing and storage and will pay 2L. In this case, the algorithm is L-competitive, and L can be arbitrarily large. Next, we show that if resource requirements are small relative to capacities, payment functions ft are arbitrary and resource requirements are identical, then no deterministic online algorithm can be competitive. This instance satisfies Assumption 2 but not Assumption 1. Consider again a graph G with a single compute node and single data storage node each with one-dimensional, unit capacities. We will use up to k + 1 jobs, each requiring 1/k units of computing and storage. The i-th job, 0 ≤ i ≤ k, will pay Mi for some large value M. Now, consider any deterministic algorithm. If it fails to accept any job j < k, then if job j is the last job, it will be Ω(M)- competitive. If the algorithm accepts jobs 0 up through k − 1 then it will not be able to accept job k and will be Ω(M)-competitive. In all cases it has competitive ratio at least Ω(M) and M and k can be arbitrarily large. Similarly, if resource requirements are small relative to capacities, payment functions ft are exactly equal to the total amount of resources requested and resource requirements are arbitrary, then no deterministic online algorithm can be competitive. Consider once more a graph G with a single compute node and single data store node with one-dimensional compute/storage capacities. However, this time the compute capacity will be 1 and the storage capacity will be some very large L. We will use up to k+1 jobs, each requiring 1/k units of computing. The i-th job, 0 ≤ i ≤ k, will require the appropriate amount of storage so that its value is Mi for very large M. Assuming L = O(kMk ), all these storage requirements are at most 1/k of L. Note that storage can accommodate all jobs, but computing can accommodate at most k jobs. Any deterministic algorithm will have competitive ratio Ω(M) and k, M and L can be suitably large. Thus, it follows that some flavor of Assumptions 1 and 2 are necessary to achieve any interesting competitive result. D Proof of Theorem 8 We adapt the framework of [8] to solve the online MaxCP problem. This framework uses an exponential cost function to place a price on remaining capacity of a node. If the value obtained from a task can cover the cost of the capacity it consumes, we admit the task. In the algorithm below, e is the base of the natural logarithm. We first show that our algorithm will not exceed capacities. Essentially, this occurs because the cost will always be sufficiently high. Lemma 8. Capacity constraints are not violated at any time during this algorithm.Algorithm 1 Online algorithm for MaxCP. 1: λu(1) ← 0, λv(1) ← 0 for all u ∈ U, v ∈ V 2: for each new task j do 3: costu(j) ← 1 2 (e λu(j) ln(2F +1) 1− − 1) 4: costv(j) ← 1 2 (e λv(j) ln(2F +1) 1− − 1) 5: For all uv let Ztuv = pj (u,v) cu costu(j) + sj (u,v) dv costv(j) 6: Let uv maximize fj (u, v) subject to Zjuv < fj (u, v) 7: if such uv exist with fj (u, v) > 0 then 8: Assign j to uv 9: λu(j + 1) ← λu(j) + pj (u,v) cu 10: λv(j + 1) ← λv(j) + sj (u,v) dv 11: For all other u 0 6= u let λu0 (j + 1) ← λu0 (j) 12: For all other v 0 6= v let λv0 (j + 1) ← λv0 (j) 13: else 14: Reject task j 15: For all u let λu(j + 1) ← λu(j) 16: For all v let λv(j + 1) ← λv(j) 17: end if 18: end for Proof. Note that λu(n + 1) will be 1 cu Σt,vpt(u, v)xtuv, since any time we assign a job j to uv we immediately increase λu(j + 1) by the appopriate amount. Thus if we can prove λu(n + 1) ≤ 1 we will not violate capacity of u. Initially we had λu(1) = 0 < 1, so suppose that the first time we exceed capacity is after the placement of job j. Thus we have λu(j) ≤ 1 < λu(j + 1). By applying assumption 2 we have λu(j) > 1 − . From this it follows that costu(j) > 1 2 (e ln(2F +1) − 1) = F, and since these costs are always non-negative we must have had Zjuv > pj (u,v) cu F ≥ fj (u, v) by applying assumption 1. But then we must have rejected job j and would have λu(j + 1) = λu(j) Identical reasoning applies to v ∈ V . Next, we bound the algorithms revenue from below using the sum of the node costs. Lemma 9. Let A(j) be the total objective value Σt,u,vxtuvft(u, v) obtained by P the algorithm immediately before job j arrives. Then (3e ln(2F + 1))A(j) ≥ u∈U costu(j) + P v∈V costv(j). Proof. The proof will be by induction on j; the base case where j = 1 is immediate since no jobs have yet arrived or been scheduled and costu(1) = costv(1) = 0 for all u and v. Consider what happens when job j arrives. If this job is rejected, neither side of the inequality changes and the induction holds. Otherwise, suppose job j is assigned to uv. We have: A(j + 1) = A(j) + fj (u, v)We can bound the new value of the righthand side by observing that since costu has derivative increasing in the value of λu, the new value will be at most the new derivative times the increase in λu. It follows that: costu(j + 1) ≤ costu(j) + (λu(j + 1) − λu(j))1 2 ( ln(2F + 1) 1 −  )(e λu(j+1) ln(2F +1) 1− ) costu(j + 1) ≤ costu(j) + pj (u, v) cu ( ln(2F + 1) 1 −  )(1 2 e λu(j) ln(2F +1) 1− )(e  ln(2F +1) 1− ) costu(j + 1) ≤ costu(j) + pj (u, v) cu ln(2F + 1) 1 −  (costu(j) + 1 2 )(e  ln(2F +1)) Applying assumption 2 gives: costu(j + 1) ≤ costu(j) + (2e ln(2F + 1))(pj (u, v) cu costu(j) + 1 4 ) Identical reasoning can be applied to costv, allowing us to show that the increase in the righthand side is at most: (2e ln(2F + 1))(pj (u, v) cu costu(j) + sj (u, v) du costv(j) + 1 2 ) Since j was assigned to uv, we must have fj (u, v) > pj (u,v) cu costu(j)+sj (u,v) dv costv(j); from assumption 1 we also have fj (u, v) ≥ 1 so we can conclude that the increase in the righthand side is at most: (3e ln(2F + 1))fj (u, v) ≤ (3e ln(2F + 1))(A(j + 1) − A(j)) Now, we can bound the profit the optimum solution gets from tasks which we either fail to assign, or assign with a lower value of ft(u, v). The reason we did not assign these tasks was because the node costs were suitably high. Thus, we can bound the profit of tasks using the node costs. Lemma 10. Suppose the optimum solution assigned j to u, v, but the online algorithm either rejected j or assigned it to some u 0 , v0 with fj (u 0 , v0 ) < fj (u, v). Then pj (u,v) cu costu(n + 1) + sj (u,v) dv costv(n + 1) ≥ fj (u, v) Proof. When the algorithm considered j, it would find the u, v with maximum fj (u, v) satisfying Zjuv < fj (u, v). Since the algortihm either could not find such u, v or else selected u 0 , v0 with fj (u 0 , v0 ) < fj (u, v) it must be that Zjuv ≥ fj (u, v). The lemma then follows by inserting the definition of Zjuv and then observing that costu and costv only increase as the algorithm continues.Lemma 11. Let Q be the total value of tasks which the optimum offline algorithm assigns, but which Algorithm 1 either rejects or assigns to a uv with lower value of ft(u, v). Then Q ≤ Σu∈U costu(n + 1) + Σv∈V costv(n + 1). Proof. Consider any task q as described above. Suppose offline optimum assigns q to uq, vq. By applying lemma 10 we have: Q = Σqfq(uq, vq) ≤ Σq pq(uq, vq) cu costuq (n + 1) + sq(uq, vq) dv costq(n + 1) The lemma then follows from the fact that the offline algorithm must obey the capacity constraints. Finally, we can combine Lemmas 9 and 11 to bound our total profit. In particular, this shows that we are within a factor 3e ln(2F + 1) of the optimum offline solution, for an O(log F)-competitive algorithm. Theorem 11. Algorithm 1 never violates capacity constraints and is O(log F)- competitive. We can extend the result to k-sided placement, and can get a slight improvement in the required assumptions if we are willing to randomize. The results are given below: Theorem 12. For the k-sided placement problem, we can adapt algorithm 1 to be O(log kF)-competitive provided that assumption 2 is tightened to  = min( 1 2 , 1 ln(kF +1) ). Proof. We must modify the definition of cost to: costu(j) = 1 k (e λu(j) ln(kF +1) 1− − 1) The rest of the proof will then go through. The intuition for the increase in competitive ratio is that we need to assign the first task to arrive (otherwise after this task our competitive ratio would be unbounded). This task potentially uses up space on k machines while obtaining a value of only 1. So as the value of k increases, the ratio of “best” to “worst” task effectively increases as well. Theorem 13. If we select a random power of two z ∈ [1, F] and then reject all placements with ft(u, v) < z or ft(u, v) > 2z, then we can obtain a competitive ratio of O(log F log k) while weakening assumption 2 to  = min( 1 2 , 1 ln(2k+1) ). Note that in the specific case of two-sided placement this is O(log F)-competitive requiring only that no single job consumes more than a constant fraction of any machine. Proof. Once we make our random selection of z, we effectively have F = 2 and can apply the algorithm and analysis above. The selection of z causes us to lose (in expectation) all but 1 log F of the possible profit, so we have to multiply this into our competitive ratio. Collaboration in the Cloud at Google Yunting Sun, Diane Lambert, Makoto Uchida, Nicolas Remy Google Inc. January 8, 2014 Abstract Through a detailed analysis of logs of activity for all Google employees1 , this paper shows how the Google Docs suite (documents, spreadsheets and slides) enables and increases collaboration within Google. In particular, visualization and analysis of the evolution of Google’s collaboration network show that new employees2 , have started collaborating more quickly and with more people as usage of Docs has grown. Over the last two years, the percentage of new employees who collaborate on Docs per month has risen from 70% to 90% and the percentage who collaborate with more than two people has doubled from 35% to 70%. Moreover, the culture of collaboration has become more open, with public sharing within Google overtaking private sharing. 1 Introduction Google Docs is a cloud productivity suite and it is designed to make collaboration easy and natural, regardless of whether users are in the same or different locations, working at the same or different times, or working on desktops or mobile devices. Edits and comments on the document are displayed as they are made, even if many people are simultaneously writing and commenting on or viewing the document. Comments enable real-time discussion and feedback on the document, without changing the document itself. Authors are notified when a new comment is made or replied to, and authors can continue a conversation by replying to the comment, or end the discussion by resolving it, or re-start the discussion by re-opening a closed discussion stream. Because documents are stored in the cloud, users can access any document they own or that has been shared with them anywhere, any time and on any device. The question is whether this enriched model of collaboration matters? There have been a few previous qualitative analyses of the effects of Google Docs on collaboration. For example, the review of Google Docs in [1] suggested that its features should improve collaboration and productivity among college students. A technical report [2] from the University of Southern Queensland, Australia argued that Google Docs can overcome barriers to usability such as difficulty of installation and document version control and help resolve conflicts among co-authors of research papers. There has also been at least one rigorous study of the effect of Google Docs on collaboration. Blau and Caspi [3] ran a small experiment that was designed to compare collaboration on writing documents to merely sharing documents. In their experiment, 118 undergraduate students of the Open University of Israel were randomized to one of five groups in which they shared their written assignments and received feedback from other students to varying degrees, ranging from keeping texts 1Full-time Google employees, excluding interns, part-times, vendors, etc 2Full-time employees who have joined Google for less than 90 days 12 COLLABORATION VISUALIZATION private to allowing in-text suggestions or allowing in-text edits. None of the students had used Google Docs previously. The authors found that only students in the collaboration group perceived the quality of their final document to be higher after receiving feedback, and students in all groups thought that collaboration improves documents. This paper takes a different approach, and looks for the effects of collaboration on a large, diverse organization with thousands of users over a much longer period of time. The first part of the paper describes some of the contexts in which Google Docs is used for collaboration, and the second part analyzes how collaboration has evolved over the last two years. 2 Collaboration Visualization 2.1 The Data This section introduces a way to visualize the events during a collaboration and some simple statistics that summarize how widespread collaboration using Google Docs is at Google. The graphics and metrics are based on the view, edit and comment actions of all full-time employees on tens of thousands of documents created in April 2013. 2.2 A Simple Example To start, a document with three collaborators Adam (A), Bryant (B) and Catherine (C) is shown in Figure 1. The horizontal axis represents time during the collaboration. The vertical axis is broken into three regions representing viewing, editing and commenting. Each contributor is assigned a color. A box with the contributor’s color is drawn in any time interval in which the contributor was active, at a vertical position that indicates what the user was doing in that time interval. This allows us to see when contributors were active and how often they contributed to the document. Stacking the boxes allows us to show when contributors were acting at the same time. Only time intervals in which at least one contributor was active are shown, and gaps in time that are shorter than a threshold are ignored. Gray vertical bars of fixed width are used to represent periods of no activity that are longer than the threshold. In this paper, the threshold is set to be 12 hours in all examples. In Figure 1, an interval represents an hour. Adam and Bryant edited the document together during the hour of 10 AM May 4 and Bryant edited alone in the following hour. The collaboration paused for 8 days and resumed during the hour of 2 pm on May 12. Adam, Bryant and Catherine all viewed the document during that hour. Catherine commented on the document in the next hour. Altogether, the collaboration had two active sessions, with a pause of 8 days between them. Figure 1: This figure shows an example of the collaboration visualization technique. Each colored block except the gray one represents an hour and the gray one represents a period of no activity. The Y axis is the number of users for each action type. This document has three contributors, each assigned a different color. Although we have used color to represent collaborators here, we could instead use color to represent the locations of the collaborators, their organizations, or other variables. Examples with different colorings are given in Sections 2.5 and 2.6. 2 Google Inc.2 COLLABORATION VISUALIZATION 2.3 Collaboration Metrics 2.3 Collaboration Metrics To estimate the percentage of users who concurrently edit a document and the percentage of documents which had concurrent editing, we discretize the timestamps of editing actions into 15 minute intervals and consider editing actions by different contributors in the same 15 minute interval to be concurrent. Two users who edit the same document but always more than 15 minutes apart would not be considered as concurrent, although they would still be considered collaborators. Edge cases in which two collaborators edit the same document within 15 minutes of each other but in two adjacent 15 minute intervals would not be counted as concurrent events. The choice of 15 minutes is arbitrary; however, metrics based on a 15 minute discretization and a 5 minute discretization are little different. The choice of 15 minute intervals makes computation faster. A more accurate approach would be to look for sequences of editing actions by different users with gaps below 15 minutes, but that requires considerably more computing. 2.4 Collaborative Editing Collaborative editing is common at Google. 53% of the documents that were created and shared in April 2013 were edited by more than one employee, and half of those had at least one concurrent editing session in the following six months. Looking at employees instead of documents, 80% of the employees who edited any document contributed content to a document owned by others and 65% participated in at least one 15 minute concurrent editing session in April 2013. Concurrent editing is sticky, in the sense that 76% of the employees who participate in a 15 minute concurrent editing session in April will do so again the following month. There are many use cases for collaborative editing, including weekly reports, design documents, and coding interviews. The following three plots show an example of each of these use cases. Figure 2: Collaboration activity on a design document. The X axis is time in hours and the Y axis is the number of users for each action type. The document was mainly edited by 3 employees, commented on by 18 and viewed by 50+. Google Inc. 32.5 Commenting 2 COLLABORATION VISUALIZATION Figure 2 shows the life of a design document created by engineers. The X axis is time in hours and the Y axis is the number of employees working on the document for each action type. The document was mainly edited by three employees, commented on by 18 employees and viewed by more than 50 employees from three major locations. This document was completed within two weeks and viewed many times in the subsequent month. Design documents are common at Google, and they typically have many contributors. Figure 3 shows the life of a weekly report document. Each bar represents a day and the Y axis is the number of employees who edited and viewed the document in a day. This document has the following submission rules: • Wednesday, AM: Reminder for submissions • Wednesday, PM: All teams submit updates • Thursday, AM: Document is locked The activities on the document exhibit a pronounced weekly pattern that mirrors the submission rules. Weekly reports and meeting notes that are updated regularly are often used by employees to keep everyone up-to-date as projects progress. Figure 3: Collaboration on a weekly report. The X axis is time in days and the Y axis is the number of users for each action type. The activities exhibit a pronounced weekly pattern and reflect the submission rules of the document. Finally, Figure 4 shows the life of a document used in an interview. The X axis represents time in minutes. The document was prepared by a recruiter and then viewed by an engineer. At the beginning of the interview, the engineer edited the document and the candidate then wrote code in the document. The engineer was able to watch the candidate typing. At the end of the interview, the candidate’s access to the document was revoked so no further change could be made, and the document was reviewed by the engineer. Collaborative editing allows the coding interview to take place remotely, and it is an integral part of interviews for software engineers at Google. Figure 4: The activity on a phone interview document. The X axis is time in minutes and the Y axis is the number of users for each action type. The engineer was able to watch the candidate typing on the document during a remote interview. 2.5 Commenting Commenting is common at Google. 30% of the documents created in April 2013 that are shared received comments within six months of creation. 57% of the employees who used Google Docs in April commented at least once in April, and 80% of the users who commented in April commented again in the following month. 4 Google Inc.2 COLLABORATION VISUALIZATION 2.6 Collaboration Across Sites Figure 5: Commenting and editing on a design document. The X axis is time in hours and the Y axis is the number of user actions for each user location. There are four user actions, each assigned a different color. Timestamps are in Pacific time. Figure 5 shows the life of a design document. Here color represents the type of user action (create a comment, reply to a comment, resolve a comment and edit the document), and the Y axis is split into two locations. The document was written by one engineering team and reviewed by another. The review team used commenting to raise many questions, which the engineering team resolved over the next few days. Collaborators were located in London, UK and Mountain View, California, with a nine hour time zone difference, so the two teams were almost ”taking turns” working on the document (timestamps are in Pacific time). There are many similar communication patterns between engineers via commenting to ask questions, have discussions and suggest modifications. 2.6 Collaboration Across Sites Employees use the Docs suite to collaborate with colleagues across the world, as Figure 6 shows. In that figure, employees working from nine locations in eight countries across the globe contributed to a document that was written within a week. The document was either viewed or edited with gaps of less than 12 hours (the threshold for suppressing gaps in the plot) in the first seven days as people worked in their local timezones. After final changes were made to the document, it was reviewed by people in Dublin, Mountain View, and New York. Figure 7 shows one month of global collaborations for full-time employees using Google Docs. The blue dots show the locations of the employees and a line connects two locations if a document is created in one location and viewed in the other. The warmer the color of the line, moving from green to red, the more documents shared between the two locations. Google Inc. 52.6 Collaboration Across Sites 2 COLLABORATION VISUALIZATION Figure 6: Activity on a document. Each user location is assigned a different color. The X axis is time in hours and the Y axis is the number of locations for each action type. Users from nine different locations contributed to the document. Figure 7: Global collaboration on Docs. The blue dots are locations and the dots are connected if there is collaboration on Google Docs between the two locations. 6 Google Inc.3 THE EVOLUTION OF COLLABORATION 2.7 Cross Device Work 2.7 Cross Device Work The advantage of cloud-based software and storage is that a document can be accessed from any device. Figure 8 shows one employee’s visits to a document from multiple devices and locations. When the employee was in Paris, a desktop or laptop was used during working hours and a mobile device during non-working hours. Apparently, the employee traveled to Aix-En-Provence on August 18. On August 18 and the first part of August 19, the employee continued working on the same document from a mobile device while on the move. Figure 8: Visits to a document by one user working on multiple devices and from multiple locations. Not surprisingly, the pattern of working on desktops or laptops during working hours and on mobile devices out of business hours holds generally at Google, as Figure 9 shows. The day of week is shown on the X axis and hour of day in local time on the Y axis. Each pixel is colored according to the average number of employees working in Google Docs in a day of week and time of day slot, with brighter colors representing higher numbers. Pixel values are normalized within each plot separately. Desktop and laptop usage of Google Docs peaks during conventional working hours (9:00 AM to 11:00 AM and 1:00 PM to 5:00 PM), while mobile device usage peaks during conventional commuting and other out-of-office hours (7:00 AM to 9:00 AM and 6:00 PM to 8:00 PM). Figure 9: The average number of active users working in Google Docs in each day of week and time of day slot. The X axis is day of the week and the Y axis is time of the day in local time. Desktop/Laptop usage peaks during working hours while mobile usage peaks at out-of-office working hours. 3 The Evolution of Collaboration 3.1 The Data This section explores changes in the usage of Google Docs over time. Section 2 defined collaborators as users who edited or commented on the same document and used logs of employee editing, viewing and commenting actions to describe collaboration within Google. This section defines collaborators differently using metadata on documents. Metadata is much less rich than the event history logs used in Section 2, but metadata is retained for a much longer period of time. Document metadata includes the document creation time and the last time that the document Google Inc. 73.2 Collaboration for New Employees 3 THE EVOLUTION OF COLLABORATION was accessed, but no other information about its revision history. However, the metadata does include the identification numbers for employees who have subscribed to the document, where a subscriber is anyone who has permission to view, edit or comment on a document and who has viewed the document at least once. Here we use metadata on documents, slides and spreadsheets. We call two employees collaborators (or subscription collaborators to be clear) if one is a subscriber to a document owned by the other and has viewed the document at least once and the document has fewer than 20 subscribers. The owner of the document is said to have shared the document with the subscriber. The number of subscribers is capped at 20 to avoid overcounting collaborators. The more subscribers the document has, the less likely it is that all the subscribers contributed to the document. There is no timestamp for when the employee subscribed to the document in the metadata, so the exact time of the collaboration is not known. Instead, the document creation time, which is known, is taken to be the time of the collaboration. An analysis (not shown here) of the event history data discussed in Section 2 showed that most collaborators join a collaboration soon after a document is created, so taking collaboration time to be document creation time is not unreasonable. To make this assumption even more tenable, we exclude documents for which the time of the last view, comment or edit is more than six months after the document was created. This section uses metadata on documents created between January 1, 2011 and March 31, 2013. We say that two employees had a subscription collaboration in July if they collaborated on a document that was created in July. 3.2 Collaboration for New Employees Here we define the new employees for a given month to be all the employees who joined Google no more than 90 days before the beginning of the month and started using Google Docs in the given month. For example, employees called new in the month of January 2011 must have joined Google no more than 90 days before January 1, 2011 and used Google Docs in January 2011. Each month can include different employees. New employees are said to share a document if they own a document that someone else subscribed to, whether or not the person subscribed to the document is a new employee. Similarly, a new employee is counted as a subscriber, regardless of the tenure of the document creator. Figure 10 shows that collaboration among new employees has increased since 2011. Over the last two years, subscribing has risen from 55% to 85%, sharing has risen from 30% to 50%, and the fraction of users who either share or subscribe has risen from 70% to 90%. In other words, new employees are collaborating earlier in their career, so there is a faster ramp-up and easier access to collective knowledge. Figure 10: This figure shows the percentage of new employees who share, subscribe to others’ documents and either share or subscribe in each one-month period over the last two years. Not only do new employees start collaborating more often (as measured by subscription and sharing), they also collaborate with more people. Figure 11 shows the percentage of new employees with at least a given number of collaborators by month. For example, the percentage of 8 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.3 Collaboration in Sales and Marketing new employees with at least three subscription collaborators was 35% in January 2011 (the bottom red curve) and 70% in March 2013 (the top blue curve), a doubling over two years. It is interesting that the curves hardly cross each other and the curves for the farthest back months lie below those for recent months, suggesting that there has been steady growth in the number of subscription collaborators per new employee over this period. Figure 11: This figure shows the proportion of new employees who have at least a given number of collaborators in each one-month period. Each period is assigned a different color. The cooler the color of the curve, moving from red to blue, the more recent the month. The legend only shows the labels for a subset of curves. The percentage of new employees who have at least three collaborators has doubled from 35% to 70%. To present the data in Figure 11 in another way, Table 1 shows percentiles of the distribution of the number of subscription collaborators per new employee using Google Docs in January 2011 and in January 2013. For example, the lowest 25% of new employees using Google Docs had no such collaborators in January 2011 and two such collaborators in January 2013. 25% 50% 75% 90% 95% January 2011 0 1 4 7 11 January 2013 2 5 10 17 22 Table 1: This table shows the percentile of number of collaborators a new employee have in January 2011 and January 2013. The entire distribution shifts to the right. 3.3 Collaboration in Sales and Marketing Section 3.2 compared new employees who joined Google in different months. This section follows current employees in Sales and Marketing who joined Google before January 1, 2011. That is, the previous section considered changes in new employee behavior over time and this section considers changes in behavior for a fixed set of employees over time. We only analyze subscription collaborations among this fixed set of employees and collaborations with employees not in this set are excluded. Figure 12: This figure shows the percentage of current employees in Sales and Marketing who have at least a given number of collaborators in each onemonth period. Figure 12 shows the percentage of current employees in Sales and Marketing who have at least Google Inc. 93.4 Collaboration Between Organizations 3 THE EVOLUTION OF COLLABORATION a given number of collaborators at several times in the past. There we see that more employees are sharing and subscribing over time because the fraction of the group with at least one subscription collaborator has increased from 80% to 95%. And the fraction of the group with at least three subscription collaborators has increased from 50% to 80%. It shows that many of the employees who used to have no or very few subscription collaborators have migrated to having multiple subscription collaborators. In other words, the distribution of number of subscription collaborators for employees who have been in Sales and Marketing since January 1, 2011 has shifted right over time, which implies that collaboration in that group of employees has increased over time. Finally, the number of documents shared by the employees who have been in Sales and Marketing at Google since January 1, 2011 has nearly doubled over the last two years. Figure 13 shows the number of shared documents normalized by the number of shared documents in January, 2011. Figure 13: This figure shows the number of shared documents created by employees in Sales and Marketing each month normalized by the number of shared documents in January 2011. The number has almost doubled over the last two years. 3.4 Collaboration Between Organizations Collaboration between organizations has increased over time. To show that, we consider hundreds of employees in nine teams within the Sales and Marketing group and the Engineering and Product Management group who joined Google before January 1, 2011, were still active in March 31, 2013 and used Google Docs in that period. Figure 14 represents the Engineering and Product Management employees as red dots and the Sales and Marketing employees as blue dots. The same dots are included in all three plots in Figure 14 because the employees included in this analysis do not change. A line connects two dots if the two employees had at least one subscription collaboration in the month shown. The denser the lines in the graph, the more collaboration, and the more lines connecting red and blue dots, the more collaboration between organizations. Clearly, subscription collaboration has increased both within and across organizations in the past two years. Moreover, the network shows more pronounced communities (groups of connected dots) over time. Although there are nine individual teams, there seems to be only three major communities in the network. Figure 14 indicates that teams can work closely with each other even though they belong to separate departments. We also sampled 187 teams within the Sales and Marketing group and the Engineering and Product Management group. Figure 15 represents teams in Engineering and Product Management as red dots and teams in Sales and Marketing as blue dots. Two dots are connected if the two teams had a least one subscription collaboration between their members in the month. Figure 15 shows that the collaboration between those teams has increased and the interaction between the two organizations has becomed stronger over the past two years. 10 Google Inc.3 THE EVOLUTION OF COLLABORATION 3.4 Collaboration Between Organizations Figure 14: An example of collaboration across organizations. Red dots represent employees in Engineering and Product Management and blue dots represent employees in Sales and Marketing Figure 15: An example of collaboration between teams. Red dots represent teams in Engineering and Product Management and blue dots represent teams in Sales and Marketing Google Inc. 113.5 Cultural Changes in Collaboration 4 CONCLUSIONS 3.5 Cultural Changes in Collaboration Google Docs allows users to specify the access level (visibility) of their documents. The default access level in Google Docs is private, which means that only the user who created the document or the current owner of the document can view it. Employees can change the access level on a document they own and allow more people to access it. For example, the document owner can specify particular employees who are allowed to access the document, or the owner can mark the document as public within Google, in which case any employee can access the document. Clearly, not all documents created in Google can be visible to everyone at Google, but the more documents are widely shared, the more open the environment is to collaboration. Figure 16: This figure shows the percentage of shared documents that are ”public within Google” created in each month. Public sharing is overtaking private sharing at Google. Figure 16 shows the percentage of shared documents in Google created each month between January 1, 2012 and March 31, 2013 that are public within Google. The red line, which is a curve fit to the data to smooth out variability, shows that the percentage has increased about 12% from 48% to 54% in the last year alone. In that sense, the culture of sharing is changing in Google from private sharing to public sharing. 4 Conclusions We have examined how Google employees collaborate with Docs and how that collaboration has evolved using logs of user activity and document metadata. To show the current usage of Docs in Google, we have developed a visualization technique for the revision history of a document and analyzed key features in Docs such as collaborative editing, commenting, access from anywhere and on any device. To show the evolution of collaboration in the cloud, we have analyzed new employees and a fixed group of employees in Sales and Marketing, and computed collaboration network statistics each month. We find that employees are engaged in using the Docs suite, and collaboration has grown rapidly over the last two years. It would also be interesting to conduct a similar analysis for other enterprises and see how long it would take them to reach the benchmark Google has set for collaboration on Docs. Not only has the collaboration on Docs changed at Google, the number of emails, comments on G+, calender meetings between people who work together has also had significant changes over the past few years. How those changes reinforce each other over time would also be an interesting topic to study. Acknowledgements We would like to thank Ariel Kern for her insights about collaboration on Google Docs, Penny Chu and Tony Fagan for their encouragement and support and many thanks to Jim Koehler for his constructive feedback. 12 Google Inc.REFERENCES REFERENCES References [1] Dan R. Herrick (2009). Google this!: using Google apps for collaboration and productivity. Proceeding of the ACM SIGUCCS fall conference (pp. 55-64). [2] Stijn Dekeyser, Richard Watson (2009). Extending Google Docs to Collaborate on Research Papers. Technical Report, The University of Southern Queensland, Australia. [3] Ina Blau, Avner Caspi (2009). What Type of Collaboration Helps? Psychological Ownership, Perceived Learning and Outcome Quality of Collaboration Using Google Docs. Learning in the technological era: Proceedings of the Chais conference on instructional technologies research (pp. 48-55). Google Inc. 13 Circulant Binary Embedding Felix X. Yu1 YUXINNAN@EE.COLUMBIA.EDU Sanjiv Kumar2 SANJIVK@GOOGLE.COM Yunchao Gong3 YUNCHAO@CS.UNC.EDU Shih-Fu Chang1 SFCHANG@EE.COLUMBIA.EDU 1Columbia University, New York, NY 10027 2Google Research, New York, NY 10011 3University of North Carolina at Chapel Hill, Chapel Hill, NC 27599 Abstract Binary embedding of high-dimensional data requires long codes to preserve the discriminative power of the input space. Traditional binary coding methods often suffer from very high computation and storage costs in such a scenario. To address this problem, we propose Circulant Binary Embedding (CBE) which generates binary codes by projecting the data with a circulant matrix. The circulant structure enables the use of Fast Fourier Transformation to speed up the computation. Compared to methods that use unstructured matrices, the proposed method improves the time complexity from O(d 2 ) to O(d log d), and the space complexity from O(d 2 ) to O(d) where d is the input dimensionality. We also propose a novel time-frequency alternating optimization to learn data-dependent circulant projections, which alternatively minimizes the objective in original and Fourier domains. We show by extensive experiments that the proposed approach gives much better performance than the state-of-the-art approaches for fixed time, and provides much faster computation with no performance degradation for fixed number of bits. 1. Introduction Embedding input data in binary spaces is becoming popular for efficient retrieval and learning on massive data sets (Li et al., 2011; Gong et al., 2013a; Raginsky & Lazebnik, 2009; Gong et al., 2012; Liu et al., 2011). Moreover, Proceedings of the 31 st International Conference on Machine Learning, Beijing, China, 2014. JMLR: W&CP volume 32. Copyright 2014 by the author(s). in a large number of application domains such as computer vision, biology and finance, data is typically highdimensional. When representing such high dimensional data by binary codes, it has been shown that long codes are required in order to achieve good performance. In fact, the required number of bits is O(d), where d is the input dimensionality (Li et al., 2011; Gong et al., 2013a; Sanchez ´ & Perronnin, 2011). The goal of binary embedding is to well approximate the input distance as Hamming distance so that efficient learning and retrieval can happen directly in the binary space. It is important to note that another related area called hashing is a special case with slightly different goal: creating hash tables such that points that are similar fall in the same (or nearby) bucket with high probability. In fact, even in hashing, if high accuracy is desired, one typically needs to use hundreds of hash tables involving tens of thousands of bits. Most of the existing linear binary coding approaches generate the binary code by applying a projection matrix, followed by a binarization step. Formally, given a data point, x ∈ R d , the k-bit binary code, h(x) ∈ {+1, −1} k is generated simply as h(x) = sign(Rx), (1) where R ∈ R k×d , and sign(·) is a binary map which returns element-wise sign1 . Several techniques have been proposed to generate the projection matrix randomly without taking into account the input data (Charikar, 2002; Raginsky & Lazebnik, 2009). These methods are very popular due to their simplicity but often fail to give the best performance due to their inability to adapt the codes with respect to the input data. Thus, a number of data-dependent techniques have been proposed with different optimization criteria such as reconstruction error (Kulis & Darrell, 2009), data dissimilarity (Norouzi & Fleet, 2012; Weiss et al., 1A few methods transform the linear projection via a nonlinear map before taking the sign (Weiss et al., 2008; Raginsky & Lazebnik, 2009).Circulant Binary Embedding 2008), ranking loss (Norouzi et al., 2012), quantization error after PCA (Gong et al., 2013b), and pairwise misclassification (Wang et al., 2010). These methods are shown to be effective for learning compact codes for relatively lowdimensional data. However, the O(d 2 ) computational and space costs prohibit them from being applied to learning long codes for high-dimensional data. For instance, to generate O(d)-bit binary codes for data with d ∼1M, a huge projection matrix will be required needing TBs of memory, which is not practical2 . In order to overcome these computational challenges, Gong et al. (2013a) proposed a bilinear projection based coding method for high-dimensional data. It reshapes the input vector x into a matrix Z, and applies a bilinear projection to get the binary code: h(x) = sign(RT 1 ZR2). (2) When the shapes of Z, R1, R2 are chosen appropriately, the method has time and space complexity of O(d 1.5 ) and O(d), respectively. Bilinear codes make it feasible to work with datasets with very high dimensionality and have shown good results in a variety of tasks. In this work, we propose a novel Circulant Binary Embedding (CBE) technique which is even faster than the bilinear coding. It is achieved by imposing a circulant structure on the projection matrix R in (1). This special structure allows us to use Fast Fourier Transformation (FFT) based techniques, which have been extensively used in signal processing. The proposed method further reduces the time complexity to O(d log d), enabling efficient binary embedding for very high-dimensional data3 . Table 1 compares the time and space complexity for different methods. This work makes the following contributions: • We propose the circulant binary embedding method, which has space complexity O(d) and time complexity O(d log d) (Section 2, 3). • We propose to learn the data-dependent circulant projection matrix by a novel and efficient time-frequency alternating optimization, which alternatively optimizes the objective in the original and frequency domains (Section 4). • Extensive experiments show that, compared to the state-of-the-art, the proposed method improves the result dramatically for a fixed time cost, and provides much faster computation with no performance degradation for a fixed number of bits (Section 5). 2 In principle, one can generate the random entries of the matrix on-the-fly (with fixed seeds) without needing to store the matrix. But this will increase the computational time even further. 3One could in principal use other structured matrices like Hadamard matrix along with a sparse random Gaussian matrix to achieve fast projection as was done in fast Johnson-Lindenstrauss transform(Ailon & Chazelle, 2006; Dasgupta et al., 2011), but it is still slower than circulant projection and needs more space. Method Time Space Time (Learning) Full projection O(d 2 ) O(d 2 ) O(nd3 ) Bilinear proj. O(d 1.5 ) O(d) O(nd1.5 ) Circulant proj. O(d log d) O(d) O(nd log d) Table 1. Comparison of the proposed method (Circulant proj.) with other methods for generating long codes (code dimension k comparable to input dimension d). n is the number of instances used for learning data-dependent projection matrices. 2. Circulant Binary Embedding (CBE) A circulant matrix R ∈ R d×d is a matrix defined by a vector r = (r0, r2, · · · , rd−1) T (Gray, 2006)4 . R = circ(r) :=         r0 rd−1 . . . r2 r1 r1 r0 rd−1 r2 . . . r1 r0 . . . . . . rd−2 . . . . . . rd−1 rd−1 rd−2 . . . r1 r0         . (3) Let D be a diagonal matrix with each diagonal entry being a Bernoulli variable (±1 with probability 1/2). For x ∈ R d , its d-bit Circulant Binary Embedding (CBE) with r ∈ R d is defined as: h(x) = sign(RDx), (4) where R = circ(r). The k-bit (k < d) CBE is defined as the first k elements of h(x). The need for such a D is discussed in Section 3. Note that applying D to x is equivalent to applying random sign flipping to each dimension of x. Since sign flipping can be carried out as a preprocessing step for each input x, here onwards for simplicity we will drop explicit mention of D. Hence the binary code is given as h(x) = sign(Rx). The main advantage of circulant binary embedding is its ability to use Fast Fourier Transformation (FFT) to speed up the computation. Proposition 1. For d-dimensional data, CBE has space complexity O(d), and time complexity O(d log d). Since a circulant matrix is defined by a single column/row, clearly the storage needed is O(d). Given a data point x, the d-bit CBE can be efficiently computed as follows. Denote ~ as operator of circulant convolution. Based on the definition of circulant matrix, Rx = r ~ x. (5) The above can be computed based on Discrete Fourier Transformation (DFT), for which fast algorithm (FFT) is available. The DFT of a vector t ∈ C d is a d-dimensional vector with each element defined as 4The circulant matrix is sometimes equivalently defined by “circulating” the rows instead of the columns.Circulant Binary Embedding F(t)l = X d−1 m=0 tn · e −i2πlm/d, l = 0, · · · , d − 1. (6) The above can be expressed equivalently in a matrix form as F(t) = Fdt, (7) where Fd is the d-dimensional DFT matrix. Let F H d be the conjugate transpose of Fd. It is easy to show that F −1 d = (1/d)F H d . Similarly, for any t ∈ C d , the Inverse Discrete Fourier Transformation (IDFT) is defined as F −1 (t) = (1/d)F H d t. (8) Since the convolution of two signals in their original domain is equivalent to the hadamard product in their frequency domain (Oppenheim et al., 1999), F(Rx) = F(r) ◦ F(x). (9) Therefore, h(x) = sign F −1 (F(r) ◦ F(x)) . (10) For k-bit CBE, k < d, we only need to pick the first k bits of h(x). As DFT and IDFT can be efficiently computed in O(d log d) with FFT (Oppenheim et al., 1999), generating CBE has time complexity O(d log d). 3. Randomized Circulant Binary Embedding A simple way to obtain CBE is by generating the elements of r in (3) independently from the standard normal distribution N (0, 1). We call this method randomized CBE (CBE-rand). A desirable property of any embedding method is its ability to approximate input distances in the embedded space. Suppose Hk(x1, x2) is the normalized Hamming distance between k-bit codes of a pair of points x1, x2 ∈ R d : Hk(x1, x2)= 1 k k X−1 i=0 sign(Ri·x1)−sign(Ri·x2) /2, (11) and Ri· is the i-th row of R, R = circ(r). If r is sampled from N (0, 1), from (Charikar, 2002), Pr sign(r T x1) 6= sign(r T x2)  = θ/π, (12) where θ is the angle between x1 and x2. Since all the vectors that are circulant variants of r also follow the same distribution, it is easy to see that E(Hk(x1, x2)) = θ/π. (13) For the sake of discussion, if k projections, i.e., first k rows of R, were generated independently, it is easy to show that the variance of Hk(x1, x2) will be Var(Hk(x1, x2)) = θ(π − θ)/kπ2 . (14) 0 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 log k Variance Independent Circulant (a) θ = π/12 0 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 log k Variance Independent Circulant (b) θ = π/6 0 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 log k Variance Independent Circulant (c) θ = π/3 0 1 2 3 4 5 6 7 8 0 0.05 0.1 0.15 0.2 0.25 log k Variance Independent Circulant (d) θ = π/2 Figure 1. The analytical variance of normalized hamming distance of independent bits as in (14), and the sample variance of normalized hamming distance of circulant bits, as a function of angle between points (θ) and number of bits (k). The two curves overlap. Thus, with more bits (larger k), the normalized hamming distance will be close to the expected value, with lower variance. In other words, the normalized hamming distance approximately preserves the angle5 . Unfortunately in CBE, the projections are the rows of R = circ(r), which are not independent. This makes it hard to derive the variance analytically. To better understand CBE-rand, we run simulations to compare the analytical variance of normalized hamming distance of independent projections (14), and the sample variance of normalized hamming distance of circulant projections in Figure 1. For each θ and k, we randomly generate x1, x2 ∈ R d such that their angle is θ 6 . We then generate k-dimensional code with CBE-rand, and compute the hamming distance. The variance is estimated by applying CBE-rand 1,000 times. We repeat the whole process 1,000 times, and compute the averaged variance. Surprisingly, the curves of “Independent” and “Circulant” variances are almost indistinguishable. This means that bits generated by CBE-rand are generally as good as the independent bits for angle preservation. An intuitive explanation is that for the circulant matrix, though all the rows are dependent, circulant shifting one or multiple elements will in fact result in very different projections in most cases. We will later show in experiments on real-world data that CBE-rand and Locality Sensitive Hashing (LSH)7 has almost identical performance (yet CBE-rand is significantly faster) (Section 5). 5 In this paper, we consider the case that the data points are `2 normalized. Therefore the cosine distance, i.e., 1 - cos(θ), is equivalent to the l2 distance. 6This can be achieved by extending the 2D points (1, 0), (cos θ, sin θ) to d-dimension, and performing a random orthonormal rotation, which can be formed by the Gram-Schmidt process on random vectors. 7Here, by LSH we imply the binary embedding using R such that all the rows of R are sampled iid. With slight abuse of notation, we still call it “hashing” following (Charikar, 2002).Circulant Binary Embedding Note that the distortion in input distances after circulant binary embedding comes from two sources: circulant projection, and binarization. For the circulant projection step, recent works have shown that the Johnson-Lindenstrausstype lemma holds with a slightly worse bound on the number of projections needed to preserve the input distances with high probability (Hinrichs & Vyb´ıral, 2011; Zhang & Cheng, 2013; Vyb´ıral, 2011; Krahmer & Ward, 2011). These works also show that before applying the circulant projection, an additional step of randomly flipping the signs of input dimensions is necessary8 . To show why such a step is required, let us consider the special case when x is an allone vector, 1. The circulant projection with R = circ(r) will result in a vector with all elements to be r T 1. When r is independently drawn from N (0, 1), this will be close to 0, and the norm cannot be preserved. Unfortunately the Johnson-Lindenstrauss-type results do not generalize to the distortion caused by the binarization step. One problem with the randomized CBE method is that it does not utilize the underlying data distribution while generating the matrix R. In the next section, we propose to learn R in a data-dependent fashion, to minimize the distortions due to circulant projection and binarization. 4. Learning Circulant Binary Embedding We propose data-dependent CBE (CBE-opt), by optimizing the projection matrix with a novel time-frequency alternating optimization. We consider the following objective function in learning the d-bit CBE. The extension of learning k < d bits will be shown in Section 4.2. argmin B,r ||B − XRT ||2 F + λ||RRT − I||2 F (15) s.t. R = circ(r), where X ∈ R n×d , is the data matrix containing n training points: X = [x0, · · · , xn−1] T , and B ∈ {−1, 1} n×d is the corresponding binary code matrix.9 In the above optimization, the first term minimizes distortion due to binarization. The second term tries to make the projections (rows of R, and hence the corresponding bits) as uncorrelated as possible. In other words, this helps to reduce the redundancy in the learned code. If R were to be an orthogonal matrix, the second term will vanish and the optimization would find the best rotation such that the distortion due to binarization is minimized. However, when R is a circulant matrix, R, in general, will not be orthogonal. Similar objective has been used in previous works including (Gong et al., 2013b;a) and (Wang et al., 2010). 8 For each dimension, whether the sign needs to be flipped is predetermined by a (p = 0.5) Bernoulli variable. 9 If the data is `2 normalized, we can set B ∈ {−1/ √ d, 1/ √ d} n×d to make B and XRT more comparable. This does not empirically influence the performance. 4.1. The Time-Frequency Alternating Optimization The above is a combinatorial optimization problem, for which an optimal solution is hard to find. In this section we propose a novel approach to efficiently find a local solution. The idea is to alternatively optimize the objective by fixing r, and B, respectively. For a fixed r, optimizing B can be easily performed in the input domain (“time” as opposed to “frequency”). For a fixed B, the circulant structure of R makes it difficult to optimize the objective in the input domain. Hence we propose a novel method, by optimizing r in the frequency domain based on DFT. This leads to a very efficient procedure. For a fixed r. The objective is independent on each element of B. Denote Bij as the element of the i-th row and j-th column of B. It is easy to show that B can be updated as: Bij = ( 1 if Rj·xi ≥ 0 −1 if Rj·xi < 0 , (16) i = 0, · · · , n − 1. j = 0, · · · , d − 1. For a fixed B. Define ˜r as the DFT of the circulant vector ˜r := F(r). Instead of solving r directly, we propose to solve ˜r, from which r can be recovered by IDFT. Key to our derivation is the fact that DFT projects the signal to a set of orthogonal basis. Therefore the `2 norm can be preserved. Formally, according to Parseval’s theorem , for any t ∈ C d (Oppenheim et al., 1999), ||t||2 2 = (1/d)||F(t)||2 2 . Denote diag(·) as the diagonal matrix formed by a vector. Denote <(·) and =(·) as the real and imaginary parts, respectively. We use Bi· to denote the i-th row of B. With complex arithmetic, the first term in (15) can be expressed in the frequency domain as: ||B − XRT ||2 F = 1 d nX−1 i=0 ||F(B T i· − Rxi)||2 2 (17) = 1 d nX−1 i=0 ||F(B T i·)−˜r◦F(xi)||2 2 = 1 d nX−1 i=0 ||F(B T i·)−diag(F(xi))˜r||2 2 = 1 d nX−1 i=0  F(B T i·)−diag(F(xi))˜r  T  F(B T i·)−diag(F(xi))˜r  = 1 d h <(˜r) TM<(˜r)+=(˜r) TM=(˜r)+<(˜r) T h+=(˜r) T g i +||B||2 F , where, M=diag nX−1 i=0 <(F(xi))◦<(F(xi))+=(F(xi))◦=(F(xi)) h = −2 nX−1 i=0 <(F(xi))◦<(F(B T i·))+=(F(xi)) ◦ =(F(B T i·)) g = 2 nX−1 i=0 =(F(xi)) ◦ <(F(B T i·)) − <(F(xi)) ◦ =(F(B T i·)).Circulant Binary Embedding For the second term in (15), we note that the circulant matrix can be diagonalized by DFT matrix Fd and its conjugate transpose F H d . Formally, for R = circ(r), r ∈ R d , R = (1/d)F H d diag(F(r))Fd. (18) Let Tr(·) be the trace of a matrix. Therefore, ||RRT − I||2 F = || 1 d F H d (diag(˜r) Hdiag(˜r) − I)Fd||2 F = Tr 1 d F H d (diag(˜r) Hdiag(˜r)−I) H(diag(˜r) Hdiag(˜r)−I)Fd  = Tr (diag(˜r) Hdiag(˜r) − I) H(diag(˜r) Hdiag(˜r) − I) =||˜r H ◦ ˜r − 1||2 2 = ||<(˜r) 2 + =(˜r) 2 − 1||2 2 . (19) Furthermore, as r is real-valued, additional constraints on ˜r are needed. For any u ∈ C, denote u as the complex conjugate of u. We have the following result (Oppenheim et al., 1999): For any real-valued vector t ∈ C d , F(t)0 is real-valued, and F(t)d−i = F(t)i , i = 1, · · · , bd/2c. From (17) − (19), the problem of optimizing ˜r becomes argmin ˜r <(˜r) TM<(˜r) + =(˜r) TM=(˜r) + <(˜r) T h + =(˜r) T g + λd||<(˜r) 2 + =(˜r) 2 − 1||2 2 (20) s.t. =(˜r0) = 0 <(˜ri) = <(˜rd−i), i = 1, · · · , bd/2c =(˜ri) = −=(˜rd−i), i = 1, · · · , bd/2c. The above is non-convex. Fortunately, the objective function can be decomposed, such that we can solve two variables at a time. Denote the diagonal vector of the diagonal matrix M as m. The above optimization can then be decomposed to the following sets of optimizations. argmin r˜0 m0r˜ 2 0 + h0r˜0+ λd r˜ 2 0 − 1 2 , s.t. r˜0 = r˜0. (21) argmin r˜i (mi + md−i)(<(˜ri) 2 + =(˜ri) 2 ) (22) + 2λd <(˜ri) 2 + =(˜ri) 2 − 1 2 + (hi + hd−i)<(˜ri) + (gi − gd−i)=(˜ri), i = 1, · · · , bd/2c. In (21), we need to minimize a 4 th order polynomial with one variable, with the closed form solution readily available. In (22), we need to minimize a 4 th order polynomial with two variables. Though the closed form solution is hard (requiring solution of a cubic bivariate system), we can find local minima by gradient descent, which can be considered as having constant running time for such small-scale problems. The overall objective is guaranteed to be nonincreasing in each step. In practice, we can get a good solution with just 5-10 iterations. In summary, the proposed time-frequency alternating optimization procedure has running time O(nd log d). 4.2. Learning k < d Bits In the case of learning k < d bits, we need to solve the following optimization problem: argmin B,r ||BPk−XPT k RT ||2 F +λ||RPkP T k RT −I||2 F s.t. R = circ(r), (23) in which Pk =  Ik O O Od−k  , Ik is a k × k identity matrix, and Od−k is a (d − k) × (d − k) all-zero matrix. In fact, the right multiplication of Pk can be understood as a “temporal cut-off”, which is equivalent to a frequency domain convolution. This makes the optimization difficult, as the objective in frequency domain can no longer be decomposed. To address this issues, we propose a simple solution in which Bij = 0, i = 0, · · · , n − 1, j = k, · · · , d − 1 in (15). Thus, the optimization procedure remains the same, and the cost is also O(nd log d). We will show in experiments that this heuristic provides good performance in practice. 5. Experiments To compare the performance of the proposed circulant binary embedding technique, we conducted experiments on three real-world high-dimensional datasets used by the current state-of-the-art method for generating long binary codes (Gong et al., 2013a). The Flickr-25600 dataset contains 100K images sampled from a noisy Internet image collection. Each image is represented by a 25, 600 dimensional vector. The ImageNet-51200 contains 100k images sampled from 100 random classes from ImageNet (Deng et al., 2009), each represented by a 51, 200 dimensional vector. The third dataset (ImageNet-25600) is another random subset of ImageNet containing 100K images in 25, 600 dimensional space. All the vectors are normalized to be of unit norm. We compared the performance of the randomized (CBErand) and learned (CBE-opt) versions of our circulant embeddings with the current state-of-the-art for highdimensional data, i.e., bilinear embeddings. We use both the randomized (bilinear-rand) and learned (bilinear-opt) versions. Bilinear embeddings have been shown to perform similar or better than another promising technique called Product Quantization (Jegou et al., 2011). Finally, we also compare against the binary codes produced by the baseline LSH method (Charikar, 2002), which is still applicable to 25,600 and 51,200 dimensional feature but with much longer running time and much more space. We also show an experiment with relatively low-dimensional data in 2048 dimensional space using Flickr data to compare against techniques that perform well for low-dimensional data but do not scale to high-dimensional scenario. Exam- C/C++ Thread Safety Analysis DeLesley Hutchins Google Inc. Email: delesley@google.com Aaron Ballman CERT/SEI Email: aballman@cert.org Dean Sutherland Email: dfsuther@cs.cmu.edu Abstract—Writing multithreaded programs is hard. Static analysis tools can help developers by allowing threading policies to be formally specified and mechanically checked. They essentially provide a static type system for threads, and can detect potential race conditions and deadlocks. This paper describes Clang Thread Safety Analysis, a tool which uses annotations to declare and enforce thread safety policies in C and C++ programs. Clang is a production-quality C++ compiler which is available on most platforms, and the analysis can be enabled for any build with a simple warning flag: −Wthread−safety. The analysis is deployed on a large scale at Google, where it has provided sufficient value in practice to drive widespread voluntary adoption. Contrary to popular belief, the need for annotations has not been a liability, and even confers some benefits with respect to software evolution and maintenance. I. INTRODUCTION Writing multithreaded programs is hard, because developers must consider the potential interactions between concurrently executing threads. Experience has shown that developers need help using concurrency correctly [1]. Many frameworks and libraries impose thread-related policies on their clients, but they often lack explicit documentation of those policies. Where such policies are clearly documented, that documentation frequently takes the form of explanatory prose rather than a checkable specification. Static analysis tools can help developers by allowing threading policies to be formally specified and mechanically checked. Examples of threading policies are: “the mutex mu should always be locked before reading or writing variable accountBalance” and “the draw() method should only be invoked from the GUI thread.” Formal specification of policies provides two main benefits. First, the compiler can issue warnings on policy violations. Finding potential bugs at compile time is much less expensive in terms of engineer time than debugging failed unit tests, or worse, having subtle threading bugs hit production. Second, specifications serve as a form of machine-checked documentation. Such documentation is especially important for software libraries and APIs, because engineers need to know the threading policy to correctly use them. Although documentation can be put in comments, our experience shows that comments quickly “rot” because they are not updated when variables are renamed or code is refactored. This paper describes thread safety analysis for Clang. The analysis was originally implemented in GCC [2], but the GCC version is no longer being maintained. Clang is a productionquality C++ compiler, which is available on most platforms, including MacOS, Linux, and Windows. The analysis is currently implemented as a compiler warning. It has been deployed on a large scale at Google; all C++ code at Google is now compiled with thread safety analysis enabled by default. II. OVERVIEW OF THE ANALYSIS Thread safety analysis works very much like a type system for multithreaded programs. It is based on theoretical work on race-free type systems [3]. In addition to declaring the type of data (int , float , etc.), the programmer may optionally declare how access to that data is controlled in a multithreaded environment. Clang thread safety analysis uses annotations to declare threading policies. The annotations can be written using either GNU-style attributes (e.g., attribute ((...))) or C++11- style attributes (e.g., [[...]] ). For portability, the attributes are typically hidden behind macros that are disabled when not compiling with Clang. Examples in this paper assume the use of macros; actual attribute names, along with a complete list of all attributes, can be found in the Clang documentation [4]. Figure 1 demonstrates the basic concepts behind the analysis, using the classic bank account example. The GUARDED BY attribute declares that a thread must lock mu before it can read or write to balance, thus ensuring that the increment and decrement operations are atomic. Similarly, REQUIRES declares that the calling thread must lock mu before calling withdrawImpl. Because the caller is assumed to have locked mu, it is safe to modify balance within the body of the method. In the example, the depositImpl() method lacks a REQUIRES clause, so the analysis issues a warning. Thread safety analysis is not interprocedural, so caller requirements must be explicitly declared. There is also a warning in transferFrom(), because it fails to lock b.mu even though it correctly locks this−>mu. The analysis understands that these are two separate mutexes in two different objects. Finally, there is a warning in the withdraw() method, because it fails to unlock mu. Every lock must have a corresponding unlock; the analysis detects both double locks and double unlocks. A function may acquire a lock without releasing it (or vice versa), but it must be annotated to specify this behavior. A. Running the Analysis To run the analysis, first download and install Clang [5]. Then, compile with the −Wthread−safety flag: clang −c −Wthread−s af et y example . cpp#include ” mutex . h ” class BankAcct { Mutex mu; i n t balance GUARDED BY(mu ) ; void d e p o s itIm p l ( i n t amount ) { / / WARNING! Must l o c k mu. balance += amount ; } void withd rawImpl ( i n t amount ) REQUIRES (mu) { / / OK. C a l l e r must have lo c ked mu. balance −= amount ; } public : void withd raw ( i n t amount ) { mu. l o c k ( ) ; / / OK. We ’ ve lo c ked mu. withd rawImpl ( amount ) ; / / WARNING! F a i l e d t o unlo c k mu. } void t r a n sf e rF r om ( BankAcct& b , i n t amount ) { mu. l o c k ( ) ; / / WARNING! Must l o c k b .mu. b . withd rawImpl ( amount ) ; / / OK. d e p o s itIm p l ( ) has no requi rement s . d e p o s itIm p l ( amount ) ; mu. unlo c k ( ) ; } } ; Fig. 1. Thread Safety Annotations Note that this example assumes the presence of a suitably annotated mutex.h [4] that declares which methods perform locking and unlocking. B. Thread Roles Thread safety analysis was originally designed to enforce locking policies such as the one previously described, but locks are not the only way to ensure safety. Another common pattern in many systems is to assign different roles to different threads, such as “worker thread” or “GUI thread” [6]. The same concepts used for mutexes and locking can also be used for thread roles, as shown in Figure 2. Here, a widget library has two threads, one to handle user input, like mouse clicks, and one to handle rendering. It also enforces a constraint: the draw() method should only be invoked only by the GUI thread. The analysis will warn if draw() is invoked directly from onClick(). The rest of this paper will focus discussion on mutexes in the interest of brevity, but there are analogous examples for thread roles. III. BASIC CONCEPTS Clang thread safety analysis is based on a calculus of capabilities [7] [8]. To read or write to a particular location in memory, a thread must have the capability, or permission, to do so. A capability can be thought of as an unforgeable key, #include ” ThreadRole . h ” ThreadRole Input Th read ; ThreadRole GUI Thread ; class Widget { public : v i r t u a l void o nC l i c k ( ) REQUIRES ( Input Th read ) ; v i r t u a l void draw ( ) REQUIRES ( GUI Thread ) ; } ; class Button : public Widget { public : void o nC l i c k ( ) o v e r r i d e { depressed = t rue ; draw ( ) ; / / WARNING! } } ; Fig. 2. Thread Roles or token, which the thread must present to perform the read or write. Capabilities can be either unique or shared. A unique capability cannot be copied, so only one thread can hold the capability at any one time. A shared capability may have multiple copies that are shared among multiple threads. Uniqueness is enforced by a linear type system [9]. The analysis enforces a single-writer/multiple-reader discipline. Writing to a guarded location requires a unique capability, and reading from a guarded location requires either a unique or a shared capability. In other words, many threads can read from a location at the same time because they can share the capability, but only one thread can write to it. Moreover, a thread cannot write to a memory location at the same time that another thread is reading from it, because a capability cannot be both shared and unique at the same time. This discipline ensures that programs are free of data races, where a data race is defined as a situation that occurs when multiple threads attempt to access the same location in memory at the same time, and at least one of the accesses is a write [10]. Because write operations require a unique capability, no other thread can access the memory location at that time. A. Background: Uniqueness and Linear Logic Linear logic is a formal theory for reasoning about resources; it can be used to express logical statements like: “You cannot have your cake and eat it too” [9]. A unique, or linear, variable must be used exactly once; it cannot be duplicated (used multiple times) or forgotten (not used). A unique object is produced at one point in the program, and then later consumed. Functions that use the object without consuming it must be written using a hand-off protocol. The caller hands the object to the function, thus relinquishing control of it; the function hands the object back to the caller when it returns. For example, if std :: stringstream were a linear type, stream programs would be written as follows:st d : : st r i n g st r e am ss ; / / produce ss auto& ss2 = ss << ” H e l l o ” ; / / consume ss auto& ss3 = ss2 << ” World . ” ; / / consume ss2 re tu rn ss3 . s t r ( ) ; / / consume ss3 Notice that each stream variable is used exactly once. A linear type system is unaware that ss and ss2 refer to the same stream; the calls to << conceptually consume one stream and produce another with a different name. Attempting to use ss a second time would be flagged as a use-after-consume error. Failure to call ss3. str () before returning would also be an error because ss3 would then be unused. B. Naming of Capabilities Passing unique capabilities explicitly, following the pattern described previously, would be needlessly tedious, because every read or write operation would introduce a new name. Instead, Clang thread safety analysis tracks capabilities as unnamed objects that are passed implicitly. The resulting type system is formally equivalent to linear logic but is easier to use in practical programming. Each capability is associated with a named C++ object, which identifies the capability and provides operations to produce and consume it. The C++ object itself is not unique. For example, if mu is a mutex, mu.lock() produces a unique, but unnamed, capability of type Cap (a dependent type). Similarly, mu.unlock() consumes an implicit parameter of type Cap. Operations that read or write to data that is guarded by mu follow a hand-off protocol: they consume an implicit parameter of type Cap and produce an implicit result of type Cap. C. Erasure Semantics Because capabilities are implicit and are used only for typechecking purposes, they have no run time effect. As a result, capabilities can be fully erased from an annotated program, yielding an unannoted program with identical behavior. In Clang, this erasure property is expressed in two ways. First, recommended practice is to hide the annotations behind macros, where they can be literally erased by redefining the macros to be empty. However, literal erasure is unnecessary. The analysis is entirely static and is implemented as a compile time warning; it cannot affect Clang code generation in any way. IV. THREAD SAFETY ANNOTATIONS This section provides a brief overview of the main annotations that are supported by the analysis. The full list can be found in the Clang documentation [4]. GUARDED BY(...) and PT GUARDED BY(...) GUARDED BY is an attribute on a data member; it declares that the data is protected by the given capability. Read operations on the data require at least a shared capability; write operations require a unique capability. PT GUARDED BY is similar but is intended for use on pointers and smart pointers. There is no constraint on the data member itself; rather, the data it points to is protected by the given capability. Mutex mu; i n t ∗p2 PT GUARDED BY(mu ) ; void t e s t ( ) { ∗p2 = 42; / / Warning ! p2 = new i n t ; / / OK ( no GUARDED BY ) . } REQUIRES(...) and REQUIRES SHARED(...) REQUIRES is an attribute on functions; it declares that the calling thread must have unique possession of the given capability. More than one capability may be specified, and a function may have multiple REQUIRES attributes. REQUIRES SHARED is similar, but the specified capabilities may be either shared or unique. Formally, the REQUIRES clause states that a function takes the given capability as an implicit argument and hands it back to the caller when it returns, as an implicit result. Thus, the caller must hold the capability on entry to the function and will still hold it on exit. Mutex mu; i n t a GUARDED BY(mu ) ; void foo ( ) REQUIRES (mu) { a = 0; / / OK. } void t e s t ( ) { foo ( ) ; / / Warning ! Requi res mu. } ACQUIRE(...) and RELEASE(...) The ACQUIRE attribute annotates a function that produces a unique capability (or capabilities), for example, by acquiring it from some other thread. The caller must not-hold the given capability on entry, and will hold the capability on exit. RELEASE annotates a function that consumes a unique capability, (e.g., by handing it off to some other thread). The caller must hold the given capability on entry, and will nothold it on exit. ACQUIRE SHARED and RELEASE SHARED are similar, but produce and consume shared capabilities. Formally, the ACQUIRE clause states that the function produces and returns a unique capability as an implicit result; RELEASE states that the function takes the capability as an implicit argument and consumes it. Attempts to acquire a capability that is already held or to release a capability that is not held are diagnosed with a compile time warning. CAPABILITY(...) The CAPABILITY attribute is placed on a struct, class or a typedef; it specifies that objects of that type can be used to identify a capability. For example, the threading libraries at Google define the Mutex class as follows:class CAPABILITY ( ” mutex ” ) Mutex { public : void l o c k ( ) ACQUIRE ( t hi s ) ; void reade rLock ( ) ACQUIRE SHARED( t hi s ) ; void unlo c k ( ) RELEASE( t hi s ) ; void reade rUnlock ( ) RELEASE SHARED( t hi s ) ; } ; Mutexes are ordinary C++ objects. However, each mutex object has a capability associated with it; the lock () and unlock() methods acquire and release that capability. Note that Clang thread safety analysis makes no attempt to verify the correctness of the underlying Mutex implementation. Rather, the annotations allow the interface of Mutex to be expressed in terms of capabilities. We assume that the underlying code implements that interface correctly, e.g., by ensuring that only one thread can hold the mutex at any one time. TRY ACQUIRE(b, ...) and TRY ACQUIRE SHARED(b, ...) These are attributes on a function or method that attempts to acquire the given capability and returns a boolean value indicating success or failure. The argument b must be true or false, to specify which return value indicates success. NO THREAD SAFETY ANALYSIS NO THREAD SAFETY ANALYSIS is an attribute on functions that turns off thread safety checking for the annotated function. It provides a means for programmers to opt out of the analysis for functions that either (a) are deliberately thread-unsafe, or (b) are thread-safe, but too complicated for the analysis to understand. Negative Requirements All of the previously described requirements discussed are positive requirements, where a function requires that certain capabilities be held on entry. However, the analysis can also track negative requirements, where a function requires that a capability be not-held on entry. Positive requirements are used to prevent race conditions. Negative requirements are used to prevent deadlock. Many mutex implementations are not reentrant, because making them reentrant entails a significant performance cost. Attempting to acquire a non-reentrant mutex that is already held will deadlock the program. To avoid deadlock, acquiring a capability requires a proof that the capability is not currently held. The analysis represents this proof as a negative capability, which is expressed using the ! negation operator: Mutex mu; i n t a GUARDED BY(mu ) ; void c l e a r ( ) REQUIRES ( !mu) { mu. l o c k ( ) ; a = 0; mu. unlo c k ( ) ; } void r e s et ( ) { mu. l o c k ( ) ; / / Warning ! C a l l e r cannot hold ’mu ’ . c l e a r ( ) ; mu. unlo c k ( ) ; } Negative capabilities are tracked in much the same way as positive capabilities, but there is a bit of extra subtlety. Positive requirements are typically confined within the class or the module in which they are declared. For example, if a thread-safe class declares a private mutex, and does all locking and unlocking of that mutex internally, then there is no reason clients of the class need to know that the mutex exists. Negative requirements lack this property. If a class declares a private mutex mu, and locks mu internally, then clients should theoretically have to provide proof that they have not locked mu before calling any methods of the class. Moreover, there is no way for a client function to prove that it does not hold mu, except by adding REQUIRES(!mu) to the function definition. As a result, negative requirements tend to propagate throughout the code base, which breaks encapsulation. To avoid such propagation, the analysis restricts the visibility of negative capabilities. The analyzer assumes that it holds a negative capability for any object that is not defined within the current lexical scope. The scope of a class member is assumed to be its enclosing class, while the scope of a global variable is the translation unit in which it is defined. Unfortunately, this visibility-based assumption is unsound. For example, a class with a private mutex may lock the mutex, and then call some external routine, which calls a method in the original class that attempts to lock the mutex a second time. The analysis will generate a false negative in this case. Based on our experience in deploying thread safety analysis at Google, we believe this to be a minor problem. It is relatively easy to avoid this situation by following good software design principles and maintaining proper separation of concerns. Moreover, when compiled in debug mode, the Google mutex implementation does a run time check to see if the mutex is already held, so this particular error can be caught by unit tests at run time. V. IMPLEMENTATION The Clang C++ compiler provides a sophisticated infrastructure for implementing warnings and static analysis. Clang initially parses a C++ input file to an abstract syntax tree (AST), which is an accurate representation of the original source code, down to the location of parentheses. In contrast, many compilers, including GCC, lower to an intermediate language during parsing. The accuracy of the AST makes it easier to emit quality diagnostics, but complicates the analysis in other respects. The Clang semantic analyzer (Sema) decorates the AST with semantic information. Name lookup, function overloading, operator overloading, template instantiation, and type checking are all performed by Sema when constructing the AST. Clang inserts special AST nodes for implicit C++ operations, such as automatic casts, LValue-to-RValue conversions,implicit destructor calls, and so on, so the AST provides an accurate model of C++ program semantics. Finally, the Clang analysis infrastructure constructs a control flow graph (CFG) for each function in the AST. This is not a lowering step; each statement in the CFG points back to the AST node that created it. The CFG is shared infrastructure; the thread safety analysis is only one of its many clients. A. Analysis Algorithm The thread safety analysis algorithm is flow-sensitive, but not path-sensitive. It starts by performing a topological sort of the CFG, and identifying back edges. It then walks the CFG in topological order, and computes the set of capabilities that are known to be held, or known not to be held, at every program point. When the analyzer encounters a call to a function that is annotated with ACQUIRE, it adds a capability to the set; when it encounters a call to a function that is annotated with RELEASE, it removes it from the set. Similarly, it looks for REQUIRES attributes on function calls, and GUARDED BY on loads or stores to variables. It checks that the appropriate capability is in the current set, and issues a warning if it is not. When the analyzer encounters a join point in the CFG, it checks to confirm that every predecessor basic block has the same set of capabilities on exit. Back edges are handled similarly: a loop must have the same set of capabilities on entry to and exit from the loop. Because the analysis is not path-sensitive, it cannot handle control-flow situations in which a mutex might or might not be held, depending on which branch was taken. For example: void foo ( ) { i f ( b ) mutex . l o c k ( ) ; / / Warning : b may o r may not be held here . doSomething ( ) ; i f ( b ) mutex . unlo c k ( ) ; } void l o c k A l l ( ) { / / Warning : c a p a b i l i t y s et s do not match / / at s t a r t and end of loop . fo r ( unsigned i =0; i < n ; ++ i ) mutexArray [ i ] . l o c k ( ) ; } Although this seems like a serious limitation, we have found that conditionally held locks are relatively unimportant in practical programming. Reading or writing to a guarded location in memory requires that the mutex be held unconditionally, so attempting to track locks that might be held has little benefit in practice, and usually indicates overly complex or poor-quality code. Requiring that capability sets be the same at join points also speeds up the algorithm considerably. The analyzer need not iterate to a fixpoint; thus it traverses every statement in the program exactly once. Consequently, the computational complexity of the analysis is O(n) with respect to code size. The compile time overhead of the warning is minimal. B. Intermediate Representation Each capability is associated with a C++ object. C++ objects are run time entities, that are identified by C++ expressions. The same object may be identified by different expressions in different scopes. For example: class Foo { Mutex mu; bool compare ( const Foo& ot h e r ) REQUIRES ( this−>mu, ot h e r .mu ) ; } void ba r ( ) { Foo a ; Foo ∗b ; . . . a .mu. l o c k ( ) ; b−>mu. l o c k ( ) ; / / REQUIRES (&a)−>mu, (∗ b ) .mu a . compare (∗ b ) ; . . . } Clang thread safety analysis is dependently typed: note that the REQUIRES clause depends on both this and other, which are parameters to compare. The analyzer performs variable substitution to obtain the appropriate expressions within bar(); it substitutes &a for this and ∗b for other. Recall, however, that the Clang AST does not lower C++ expressions to an intermediate language; rather, it stores them in a format that accurately represents the original source code. Consequently, (&a)−>mu and a.mu are different expressions. A dependent type system must be able to compare expressions for semantic (not syntactic) equality. The analyzer implements a simple compiler intermediate representation (IR), and lowers Clang expressions to the IR for comparison. It also converts the Clang CFG into single static assignment (SSA) form so that the analyzer will not be confused by local variables that take on different values in different places. C. Limitations Clang thread safety analysis has a number of limitations. The three major ones are: No attributes on types. Thread safety attributes are attached to declarations rather than types. For example, it is not possible to write vector, or ( int GUARDED BY(mu))[10]. If attributes could be attached to types, PT GUARDED BY would be unnecessary. Attaching attributes to types would result in a better and more accurate analysis. However, it was deemed infeasible for C++ because it would require invasive changes to the C++ type system that could potentially affect core C++ semantics in subtle ways, such as template instantiation and function overloading. No dependent type parameters. Race-free type systems as described in the literature often allow classes to be parameterized by the objects that are responsible for controlling access. [11] [3] For example, assume a Graph class has a list of nodes, and a single mutex that protects all of them. In this case, theNode class should technically be parameterized by the graph object that guards it (similar to inner classes in Java), but that relationship cannot be easily expressed with attributes. No alias analysis. C++ programs typically make heavy use of pointer aliasing; we currently lack an alias analysis. This can occasionally cause false positives, such as when a program locks a mutex using one alias, but the GUARDED BY attribute refers to the same mutex using a different alias. VI. EXPERIMENTAL RESULTS AND CONCLUSION Clang thread safety analysis is currently deployed on a wide scale at Google. The analysis is turned on by default, across the company, for every C++ build. Over 20,000 C++ files are currently annotated, with more than 140,000 annotations, and those numbers are increasing every day. The annotated code spans a wide range of projects, including many of Google’s core services. Use of the annotations at Google is entirely voluntary, so the high level of adoption suggests that engineering teams at Google have found the annotations to be useful. Because race conditions are insidious, Google uses both static analysis and dynamic analysis tools such as Thread Sanitizer [12]. We have found that these tools complement each other. Dynamic analysis operates without annotations and thus can be applied more widely. However, dynamic analysis can only detect race conditions in the subset of program executions that occur in test code. As a result, effective dynamic analysis requires good test coverage, and cannot report potential bugs until test time. Static analysis is less flexible, but covers all possible program executions; it also reports errors earlier, at compile time. Although the need for handwritten annotations may appear to be a disadvantage, we have found that the annotations confer significant benefits with respect to software evolution and maintenance. Thread safety annotations are widely used in Google’s core libraries and APIs. Annotating libraries has proven to be particularly important, because the annotations serve as a form of machine-checked documentation. The developers of a library and the clients of that library are usually different engineering teams. As a result, the client teams often do not fully understand the locking protocol employed by the library. Other documentation is usually out of date or nonexistent, so it is easy to make mistakes. By using annotations, the locking protocol becomes part of the published API, and the compiler will warn about incorrect usage. Annotations have also proven useful for enforcing internal design constraints as software evolves over time. For example, the initial design of a thread-safe class must establish certain constraints: locks are used in a particular way to protect private data. Over time, however, that class will be read and modified by many different engineers. Not only may the initial constraints be forgotten, they may change when code is refactored. When examining change logs, we found several cases in which an engineer added a new method to a class, forgot to acquire the appropriate locks, and consequently had to debug the resulting race condition by hand. When the constraints are explicitly specified with annotations, the compiler can prevent such bugs by mechanically checking new code for consistency with existing annotations. The use of annotations does entail costs beyond the effort required to write the annotations. In particular, we have found that about 50% of the warnings produced by the analysis are caused not by incorrect code but rather by incorrect or missing annotations, such as failure to put a REQUIRES attribute on getter and setter methods. Thread safety annotations are roughly analogous to the C++ const qualifier in this regard. Whether such warnings are false positives depends on your point of view. Google’s philosophy is that incorrect annotations are “bugs in the documentation.” Because APIs are read many times by many engineers, it is important that the public interfaces be accurately documented. Excluding cases in which the annotations were clearly wrong, the false positive rate is otherwise quite low: less than 5%. Most false positives are caused by either (a) pointer aliasing, (b) conditionally acquired mutexes, or (c) initialization code that does not need to acquire a mutex. Conclusion Type systems for thread safety have previously been implemented for other languages, most notably Java [3] [11]. Clang thread safety analysis brings the benefit of such systems to C++. The analysis has been implemented in a production C++ compiler, tested in a production environment, and adopted internally by one of the world’s largest software companies. REFERENCES [1] K. Asanovic et al., “A view of the parallel computing landscape,” Communications of the ACM, vol. 52, no. 10, 2009. [2] L.-C. Wu, “C/C++ thread safety annotations,” 2008. [Online]. Available: https://docs.google.com/a/google.com/document/d/1 d9MvYX3LpjTk 3nlubM5LE4dFmU91SDabVdWp9-VDxc [3] M. Abadi, C. Flanagan, and S. N. Freund, “Types for safe locking: Static race detection for java,” ACM Transactions on Programming Languages and Systems, vol. 28, no. 2, 2006. [4] “Clang thread safety analysis documentation.” [Online]. Available: http://clang.llvm.org/docs/ThreadSafetyAnalysis.html [5] “Clang: A c-language front-end for llvm.” [Online]. Available: http://clang.llvm.org [6] D. F. Sutherland and W. L. Scherlis, “Composable thread coloring,” PPoPP ’10: Proceedings of the ACM Symposium on Principles and Practice of Parallel Programming, 2010. [7] K. Crary, D. Walker, and G. Morrisett, “Typed memory management in a calculus of capabilities,” Proceedings of POPL, 1999. [8] J. Boyland, J. Noble, and W. Retert, “Capabilities for sharing,” Proceedings of ECOOP, 2001. [9] J.-Y. Girard, “Linear logic,” Theoretical computer science, vol. 50, no. 1, pp. 1–101, 1987. [10] S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson, “Eraser: A dynamic data race detector for multithreaded programs.” ACM Transactions on Computer Systems (TOCS), vol. 15, no. 4, 1997. [11] C. Boyapati and M. Rinard, “A parameterized type system for race-free Java programs,” Proceedings of OOPSLA, 2001. [12] K. Serebryany and T. Iskhodzhanov, “Threadsanitizer: data race detection in practice,” Workshop on Binary Instrumentation and Applications, 2009. An Optimized Template Matching Approach to Intra Coding in Video/Image Compression Hui Su, Jingning Han, and Yaowu Xu Chrome Media, Google Inc., 1950 Charleston Road, Mountain View, CA 94043 ABSTRACT The template matching prediction is an established approach to intra-frame coding that makes use of previously coded pixels in the same frame for reference. It compares the previously reconstructed upper and left boundaries in searching from the reference area the best matched block for prediction, and hence eliminates the need of sending additional information to reproduce the same prediction at decoder. In viewing the image signal as an auto-regressive model, this work is premised on the fact that pixels closer to the known block boundary are better predicted than those far apart. It significantly extends the scope of the template matching approach, which is typically followed by a conventional discrete cosine transform (DCT) for the prediction residuals, by employing an asymmetric discrete sine transform (ADST), whose basis functions vanish at the prediction boundary and reach maximum magnitude at far end, to fully exploit statistics of the residual signals. It was experimentally shown that the proposed scheme provides substantial coding performance gains on top of the conventional template matching method over the baseline. Keywords: Template matching, Intra prediction, Transform coding, Asymmetric discrete sine transform 1. INTRODUCTION Intra-frame coding is a key component in video/image compression system. It predicts from previously reconstructed neighboring pixels to largely remove spatial redundancies. A codec typically allows various prediction directions1–3 , and the encoder selects the one that best describes the texture patterns (and hence rendering minimal rate-distortion cost) for block coding. Such boundary extrapolation based prediction is efficient when the image signals are well modeled by a first-order Markovian process. In practice, however, image signals might contain certain complicated patterns repeatedly appearing, which the boundary prediction approach can not effectively capture. This motivates the initial block matching prediction, that searches in the previously reconstructed frame area for reference, as an additional mode.4 A displacement vector per block is hence needed to inform decoder to reproduce the prediction, akin the motion vector for inter-frame motion compensation. To overcome such overhead cost that diminishes the performance gains, a template matching prediction (TMP) approach was developed5 that employs the available neighboring pixels of a block as a template, measures the template similarity between the block of interest and the candidate references, and chooses the most “similar” one as the prediction. Clearly the decoder is able to repeat the same process without recourse to further information, which further allows the TMP to operate in smaller block size for more precise prediction at no additional cost. A conventional 2D-DCT is then applied to the prediction residuals, followed by quantization and entropy coding, to encode the block. Certain coding performance gains were obtained by integrating the TMP in a regular intra coder. In viewing the image signals as an auto-regressive process, pixels close to the block boundaries are more correlated to the template pixels, and hence are better predicted by the matched reference, than those sitting at far end. Therefore, the residuals ought to exhibit smaller variance at the known boundaries and gradually increased energy to the opposite end, which makes the efficacy of the use of DCT questionable due to the fact that its basis functions get to the maximal magnitude at both ends and are agnostic to the statistical characteristics of the residuals. This work addresses this issue by incorporating the ADST,6, 7 whose basis functions possess the desired asymmetric properties, as an alternative to the TMP residuals for optimal coding performance. A complementary similarity measurement based on weighted template matching, in recognition of the statistical E-mails: {huisu, jingning, yaowu}@google.com. Visual Information Processing and Communication V, edited by Amir Said, Onur G. Guleryuz, Robert L. Stevenson, Proc. of SPIE-IS&T Electronic Imaging, SPIE Vol. 9029, 902904 © 2014 SPIE-IS&T · CCC code: 0277-786X/14/$18 · doi: 10.1117/12.2040890 SPIE-IS&T/ Vol. 9029 902904-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsReconstructed Pixels To-be-coded Pixels Target Block Prediction Block Target Template Candidate Template Figure 1. Template matching intra prediction. variations across the block, was also proposed to improve the search quality. The scheme was implemented in the VP9 framework, in conjunction with other boundary prediction based intra coding modes. Experiments demonstrated remarkable performance advantages over conventional TMP as well as the baseline codec. The rest of the paper is organized as follows: Sec. 2 presents a brief review on the template matching approach. In Sec. 3, we describe the proposed techniques in details. Experimenting results are presented in Sec. 4, and Sec. 5 concludes the paper. 2. REVISITING MATCHING PREDICTION We provide a brief review on the TMP approach5 in this section. As shown in Fig. 1, the TMP employs the pixels in the adjacent upper rows and left columns of a block as its template. Every template in the reconstructed area of the frame is considered as a reference template, and the template of the block to be encoded is the target template. The similarity between the target template and the reference templates is then evaluated in terms of sum of absolute/squared difference. The encoder selects amongst the reference templates the one that best resembles the target template as the candidate template, and the block corresponding to this candidate template is used as the prediction for the target block. Since it only involves comparing reconstructed pixels, the same operations can be repeated at the decoder side without any additional side information sent, resulting in higher compression efficiency than the direct block matching approach.4 As a consequence, the decoding process gets more computationally loaded. The TMP approach was shown to be particularly efficient in the scenarios where certain complicated texture patterns, that cannot be captured by the conventional directional intra prediction modes, appear repeatedly in the image/frame. Recent research efforts have been devoted to further improve the TMP scheme, including combining multiple candidates with top similarity scores,8 using hybrid TMP and block matching (with displacement vector sent explicitly),9 etc. This work is focused on optimizing the original TMP approach by observing and exploiting the statistical property of the TMP residual signals. It is noteworthy that the proposed principles are generally applicable to other advanced variants as well. 3. PROPOSED TECHNIQUES We view the image signals as an auto-regressive model, which implies that two nearby pixels are more correlated than those far apart. Since the template of a matched reference block closely resembles that of the block of TEMPLATE SPIE-IS&T/ Vol. 9029 902904-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsTarget Block Pixels Template Pixels with Larger Weight Template Pixels with Smaller Weight Figure 2. The proposed weighted template matching scheme. The pixels that are closer to the target block are assigned with larger weight. interest, the pixels sitting close to the known boundaries of the two blocks are element-wisely more correlated, than those at the opposite end. Hence the pixels near the top/left boundaries are better predicted by the matched reference block, which translates into a key observation that the variance of prediction residuals tends to vanish at the known boundaries and gradually increase towards the far end. This inspires that unlike the discrete cosine transform (DCT) whose basis functions get maximum magnitude at both ends, the (near) optimal spatial transform for the TMP residuals should possess such asymmetric properties. We hence propose to employ the asymmetric discrete sine transform (ADST)6, 7 for transform coding of the TMP residuals. A complementary matching approach that expands the template to multiple boundary rows and columns, and uses a weighted sum of difference measurement is first developed for more precise referencing. A statistical study of the TMP residuals, followed by the detailed discussion of ADST will be provided next. 3.1 Weighted Template Matching In order to obtain reliable template matching, it is reasonable to define multiple layers of boundary pixels as the template of a block. In our study, we have observed that the prediction accuracy can be improved when the number of rows and columns in the template increases. However, it is not wise to adopt too many layers, as the gain in matching accuracy becomes saturated and the computational complexity explodes. In our implementation, the template consists of the pixels in the 2 rows and 2 columns above and to the left of the given block, which gives a good tradeoff between accuracy and computation complexity. The similarity between the target templates and reference templates can be measured by the sum of absolute difference (SAD). Along the line of recognizing the variations in statistics, the template pixels closer to the block are highly correlated to the block content, and hence should be more weighted in the SAD calculation than those distant ones. This idea is illustrated in Fig. 2. A weight ratio of 3:2 for the inner row/column versus the outer row/column is used in this work. 3.2 Spatial Transformation In video/image compression, the prediction residuals are typically processed via transformation to further remove the remaining spatial redundancy, before the quantization and entropy coding modules. The Karhunen-Loeve transform (KLT) is considered as the optimal spatial transform in terms of energy compaction. However, KLT SPIE-IS&T/ Vol. 9029 902904-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsis rarely used in practical coding system due to its high computation complexity. The DCT has long been a popular substitute due to its good tradeoff between energy compaction and complexity. The basis functions of the DCT are as follows: [TC ]j,i =  α cos π(j − 1)(2i − 1) 2N  , (1) where N is the block size, i, j ∈ {1, 2, · · · , N} denote the space and frequency indexes, respectively; and α =    q 1 N , if j = 1 q 2 N , otherwise It is easy to see that the basis functions of the DCT achieve their maximum energy at both ends (i.e., i = 1 or i = N). Assuming the template of a matched reference block closely approximates that of the block of interest, it is highly likely that pixels close to these known boundaries are also well predicted, while those distant pixels are less correlated, which results in a relatively higher residual variance. This postulation is verified by the following experimental study. We collected the absolute values of the TMP prediction residues element-wisely over 8000 blocks (of dimension 4 × 4) from the foreman sequence, and the average of the residue signal at each pixel location was calculated, as shown below:   4.05 4.37 4.51 5.13 4.72 5.48 5.50 6.60 5.04 5.95 6.12 7.32 5.50 6.28 6.97 8.20   As can be seen from the matrix, the variance of the prediction residue signal indeed increases along both the horizontal and vertical directions. As abovementioned, the basis functions of the conventional DCT achieve their maximum energy at both ends and are therefore agnostic to the statistical patterns of the prediction residuals. As an alternative, the ADST6, 7 has basis functions of form: [TS]j,i =  2 √ 2N + 1 sin (2j − 1)iπ 2N + 1  , (2) where N is the block size, i, j ∈ {1, 2, · · · , N} denote the space and frequency indexes, respectively. It is shown6, 7 that the ADST is a better approximation of the optimal KLT than the DCT when the partial boundary information is available. Clearly, the basis functions of ADST vanishes at the known prediction boundary (i = 1) and maximizes at the far end (i = N), and therefore matches well with the statistical patterns of the TMP residuals. We hence propose to employ the ADST as the spatial transform for the TMP residuals. It is experimentally shown in the next section that the use of ADST provides substantial performance improvement over the TMP followed by the conventional DCT. 4. EXPERIMENT RESULTS The proposed scheme was tested in the VP9 framework.1 We verified its efficacy in a relatively simplified setting, where the block size was fixed as 8 × 8. There are 10 intra prediction modes in VP9, including vertical prediction, horizontal prediction, 8 angular prediction modes, and a “true motion” mode that utilizes the left, above and corner pixels simultaneously. The TMP scheme was implemented as an additional mode to the 10 existing ones. The selection among the 11 modes is based on rate-distortion optimization. In the TMP mode, the 8 × 8 block is further partitioned into four 4 × 4 blocks, each of which is predicted via template matching, followed by the 2D-ADST transform, quantization, and reconstruction, in a raster scan order. The template consists of pixels from 2 rows and 2 columns above and to the left of the given block. For the weighted template matching, we use a weight ratio 3:2 for the inner row/column versus the outer row/column, as shown in Fig. 2. SPIE-IS&T/ Vol. 9029 902904-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/terms2 3 4 5 6 x 104 36 37 38 39 40 41 42 43 44 Bits per frame PSNR VP9 Baseline Template Matching Proposed Scheme 5 6 7 8 9 10 11 12 x 104 36 37 38 39 40 41 42 Bits per frame PSNR VP9 Baseline Template Matching Proposed Scheme Figure 3. Rate-Distortion curves of the Ice (upper) and Foreman (lower) test sequences. Several test video clips were used to compare the coding efficiency, including the Ice, Foreman, and Carphone sequence. For every test sequence, the first 75 frames were coded as key-frame (i.e., all blocks were coded in intra modes), at various bit-rates. The coding performance gains of the conventional TMP and the proposed method over the reference codec, measured by the Bjontegaard metric, are shown in Table 1. Clearly, the proposed approach that optimizes the transformation for prediction residual significantly improves the performance of TMP, and both outperform the reference VP9 baseline. The rate-distortion curves of the ice and foreman sequence are also provided in Fig. 3. It can be seen from the figure that the proposed techniques boost the coding efficiency of the conventional TMP consistently. 5. CONCLUSIONS AND FUTURE WORK This work proposed a novel approach that incorporated the ADST for TMP prediction residuals as an additional mode for intra-frame coding. A complementary template matching method along the lines of recognizing the SPIE-IS&T/ Vol. 9029 902904-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/termsTable 1. Coding performance gains over VP9 baseline in terms of bit-rate reduction percentage. Sequences Conventional TMP Proposed Method Ice 2.89 3.78 Foreman 2.88 3.33 Carphone 1.05 1.35 statistical variations across block was also provided for more precise reference search. The scheme implemented in the VP9 framework demonstrated substantial performance improvements over the conventional TMP as well as the reference codec. The TMP approach can also be applied to inter-frame prediction.10 The template of a block is defined as the pixels in the adjacent upper rows and left columns, in the same way as in the case of intra prediction. The optimal template which is best matched to that of the block of interest is found in a previously encoded reference frame, and the block to be encoded is filled in by copying the block corresponding the optimal template. By the same principles in this work, the residue signal of the template matching inter prediction should also present asymmetric statistical property across the block. We thus expect the ADST to be more efficient than the conventional DCT for the transform coding of the template matching inter prediction, and are currently working along this direction. REFERENCES [1] VP9 Video Codec , http://www.webmproject.org/vp9/. [2] Wiegand, T., Sullivan, G. J., Bjontegaard, G., and Luthra, A., “Overview of the H.264/avc video coding standard,” IEEE Trans. Circuits and Systems for Video Technology 13, 560–576 (July 2003). [3] Sullivan, G., J. Ohm, W. H., and Wiegand, T., “Overview of the high effciency video coding (HEVC) standard,” IEEE Trans. Circuits and Systems for Video Technology 22, 1649–1668 (Dec. 2012). [4] Yu, S. and Chrysafis, C., “New intra prediction using intra-macroblock motion compensation,” Tech. Rep. JVT-C151 (2002). [5] Tan, T., Boon, C., and Suzuki, Y., “Intra prediction by template matching,” IEEE Proc. ICIP , 1693–1696 (2006). [6] Han, J., Saxena, A., and Rose, K., “Towards jointly optimal spatial prediction and adaptive transform in video/image coding,” IEEE Proc. ICASSP , 726–729 (2010). [7] Han, J., Saxena, A., Melkote, V., and Rose, K., “Jointly optimized spatial prediction and block transform for video and image coding,” IEEE Trans. on Image Processing 21, 1874–1884 (2012). [8] Tan, T., Boon, C., and Suzuki, Y., “Intra prediction by averaged template matching predictors,” IEEE Proc. CCNC (2007). [9] Cherigui, S., Thoreau, D., Guillotel, P., and Perez, P., “Hybrid template and block matching algorithm for image intra prediction,” IEEE Proc. ICASSP , 781–784 (2012). [10] Sugimoto, K., Kobayashi, M., Suzuki, Y., Kato, S., and Boon, C. S., “Inter frame coding with template matching spatio-temporal prediction,” IEEE Proc. ICIP (2004). SPIE-IS&T/ Vol. 9029 902904-6 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 01/05/2015 Terms of Use: http://spiedl.org/terms How Many People Visit YouTube? Imputing Missing Events in Panels With Excess Zeros Georg M. Goerg, Yuxue Jin, Nicolas Remy, Jim Koehler1 1 Google, Inc.; United States E-mail for correspondence: gmg@google.com Abstract: Media-metering panels track TV and online usage of people to analyze viewing behavior. However, panel data is often incomplete due to nonregistered devices, non-compliant panelists, or work usage. We thus propose a probabilistic model to impute missing events in data with excess zeros using a negative-binomial hurdle model for the unobserved events and beta-binomial sub-sampling to account for missingness. We then use the presented models to estimate the number of people in Germany who visit YouTube. Keywords: imputation; missing data; zero inflation; panel data. 1 Introduction Media panels (GfK Consumer Panels, 2013) are used by advertisers to estimate reach and frequency of a campaign: reach is the fraction of the population that has seen an ad, frequency tells us how often they have seen it (on average). It is important to get good estimates from panel data, as they largely determine the cost of an ad spot on TV or a website. Na¨ıvely, one would use a sample fraction of the number of non-zero events (website visits, TV spots watched, etc.) per unit time to estimate reach; similarly, for frequency. This, however, suffers from underestimation as panels often only record a fraction of all events due to e.g., non-compliance or work usage. Correcting this bias and imputing missing events has been studied previously (Fader and Hardie, 2000; Yang et al., 2010). In this work we i) extend the beta-binomial negative-binomial (BBNB) model (Hofler and Scrogin, 2008) with a hurdle component to improve modeling excess zeros in panel data (§2); ii) present the maximum likelihood estimator (MLE) and also add prior information on missingness (§3); and iii) use the methodology to estimate – from online media panels and internal YouTube log files – how many people in Germany visit YouTube (§4). The proposed methodology can be applied to a great variety of situations where events have been counted – but some are known to be missing.2 How Many People Visit YouTube? 2 Hierarchical Event Imputation Let Ni ∈ {0, 1, 2, . . .} count the true (but unobserved) number of visits by panelist i. The population consists of people who do not visit YouTube at all (with probability q0 ∈ [0, 1]), and those who visit at least once. If she visits (overcoming the “hurdle” with probability 1 − q0), we assume that Ni is distributed according to a shifted Poisson distribution (starting at n = 1) with rate λi . For model heterogeneity among the population we use a Gamma  r, q1 1−q1  prior for λi , with r > 0 and q1 ∈ (0, 1). Overall, this yields a shifted negative binomial hurdle (NBH) distribution P (N = n; q0, q1, r) = ( q0, if n = 0, (1 − q0) · Γ(n+r−1) Γ(r)Γ(n) · (1 − q1) r q n−1 1 , if n ≥ 1. (1) We choose a hurdle, rather than a mixture, model for the excess zeros (Hu et al., 2011), since 1 − q0 can be directly interpreted as the true – but unobserved – 1+ reach: if an advertiser shows an ad on YouTube they can expect that a fraction of 1 − q0 of the population sees it at least once. Let pi be the probability a visit of user i is recorded in the panel. Assuming independence across visits the total number of recorded panel events, Ki ∈ {0, 1, 2, . . .}, thus follows a binomial distribution, Ki ∼ Bin(Ni , pi). To account for heterogeneity across the population we assume pi ∼ Beta(µ, φ), with mean µ and precision φ (Ferrari and Cribari-Neto, 2004). Here µ represents the expected non-missing rate and φ the (inverse) variation across the population. Integrating out pi gives a Beta-Binomial (BB) distribution, Ki | Ni ∼ BB(Ni ; µ, φ). (2) Combining (1) and (2) yields a hierarchical beta-binomial negative-binomial hurdle (BBNBH) imputation model with parameter vector θ = (µ, φ, q0, r, q1): Ni ∼ NBH(N; q0, r, q1) and Ki | Ni ∼ BB(K | Ni ; µ, φ). (3) 2.1 Joint Distribution The pdf of (2) can be written as g(k | n; µ, φ) =  n k  Γ(k + φµ)Γ(n − k + (1 − µ)φ) Γ(n + φ) Γ(φ) Γ(µφ)Γ(φ(1 − µ)) . For k = 0 this reduces to P (K = 0 | N, µ, φ) = Γ(n + (1 − µ)φ) Γ(n + φ) × Γ(φ) Γ(φ(1 − µ)) . (4)Goerg et al. 3 Due to the zero hurdle it is useful to treat N = 0 and N > 0 separately: P (N, K) = P (K | N) · P (N) = BB(k | n; µ, φ) · NBH(n; q0, q1, r) (5) For n = 0, (5) is non-zero only for k = 0, P (N = 0, K = 0) = q0, since P (K > N) = 0. For n > 0, P (N = n, K = k) =(1 − q0) 1 B(φµ, φ(1 − µ)) (1 − q1) r Γ(r) × Γ(k + φµ) Γ(k + 1) × Γ(n − k + φ(1 − µ)) Γ(n − k + 1) Γ(n + r − 1) Γ(n + φ) q n−1 1 × Γ(n + 1) Γ(n) . (6) 2.2 Conditional Predictive Distribution For Imputation The panel records ki events for panelist i, but we want to know how many events truly occurred. That is, we are interested in (dropping subscript i) P (N = n | K = k) = P (K = k | N = n) P (N = n) P (K = k) , (7) To obtain analytical expressions we consider k = 0 and k > 0 separately: k = 0: Either none truly happened (n = 0) or a panelist visited at least once (n > 0), but none were recorded. n = 0: P (N = 0 | K = 0) = q0 P (K = 0). (8) n > 0: P (N = n | K = 0) = 1 P (K = 0) × Γ(n + φ(1 − µ)) Γ(n + φ) Γ(φ) Γ(φ(1 − µ)) × (1 − q0) Γ(n + r − 1) Γ(n) (1 − q1) r Γ(r) q n−1 1 , where the second term comes from (4). k > 0: The zero “hurdle” for N has been surpassed for sure. n < k : By construction of Binomial subsampling P (N = n | K = k) = 0 for all n < k. (9) n ≥ k: Here P (N = n | K = k) = n · q n−1 1 Γ(n − k + (1 − µ)φ) Γ(n − k + 1)Γ(n + φ) Γ(n + r − 1)× X∞ m=0 (m + k) Γ(m + φ(1 − µ)) Γ(m + 1) Γ(m + k + r − 1) Γ(m + k + φ) q m+k−1 1 !−1 .4 How Many People Visit YouTube? Estimate Std. Err. t value P r(> |t|) µ 0.272 q0 0.641 0.016 38.858 0.000 q1 0.982 0.002 494.105 0.000 r 0.252 0.021 11.811 0.000 φ 2.320 0.594 3.907 0.000 TABLE 1: MLE for θ for panel data on YouTube visits in Germany. 3 Parameter Estimation Let k = {k1, . . . , kP } be the number of observed events for all P panelist. Each panelist also has socio-economic indicators such as gender, age, and income. These attributes determine their demographic weight ˜wi , which equals the number of people in the entire population that panelist i represents. Finally, let wi = ˜wi ·  P/PP i=1 w˜i  be re-scaled weight of panelist i such that PP i=1 wi equals sample size P. We estimate θ using maximum likelihood (MLE), θb = arg maxθ∈Θ `(θ; x), where the log-likelihood `(θ; x) = X {k|xk>0} xk · log P (K = k; θ), (10) and x = {xk | k = 0, 1, . . . , max (k)}, where xk = P {i|ki=k} wi is the total weight of all panelists with k visits. For deriving closed form expressions of P (K = k) = P∞ n=0 P (N = n, K = k) it is simpler to consider k = 0 and k > 0 separately: P (K = 0) = q0 + (1 − q0) × Γ(φ) Γ(φ(1 − µ)) (1 − q1) r Γ(r) × X∞ n=0 Γ(n + 1 + φ(1 − µ)) Γ(n + 1) Γ(n + r) Γ(n + 1 + φ) q n 1 , (11) and for k > 0, P (K = k) =(1 − q0)(1 − q1) r Γ(φ) Γ(µφ)Γ(φ(1 − µ)) 1 Γ(r) × Γ(k + µφ) Γ(k + 1) × X∞ m=0 (m + k) Γ(m + φ(1 − µ)) Γ(m + 1) Γ(m + k + r − 1) Γ(m + k + φ) q m+k−1 1 . (12)Goerg et al. 5 0 4 8 12 17 22 cdf 0.65 0.80 P(N <= n; r = 0.25, q1 = 0.98, q0 = 0.64) true counts (N) q0 = 64 % 0.0 0.2 0.4 0.6 0.8 1.0 0 1 2 3 4 5 pdf Beta(p; µ = 0.27, φ = 2.3) non−missingness rate α = 0.63, β = 1.7 0 5 10 15 20 25 0.80 cdf 0.90 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● P(K <= k; θ) observed counts (K) ● empirical model Log−likelihood: −6466.69 0 2 4 6 8 10 K = 0 K = 2 pmf 0.0 0.4 0.8 P(N = n | K = k; θ) true counts (N) E(N|K=0) = 1.02 E(N|K=2) = 13.12 79.5% FIGURE 1: Model estimates for: (top left) true counts Ni ; (top right) nonmissing rate pi ; (bottom left) empirical count frequency and model fit; (bottom right) conditional predictive distributions and expectations. 3.1 Fix expected non-missing rate µ Usually, researchers must estimate all 5 parameters from panel data. For our application, though, we can estimate (and fix) the non-missing rate µ a-priori as we have access to internal YouTube log files. Let ¯kW˜ = PP i=1 w˜iki be the observed panel visits projected to the entire population. Analogously, let N¯W˜ = PP i=1 w˜iNi be the panel projections of the number of true YouTube visits. While any single Ni is unobservable, we can estimate N¯W˜ by simply counting all YouTube homepage views in Germany from our YouTube log files, yielding Nb¯W˜ . We herewith obtain a plug-in estimate of the non-missing rate, µbLogs = ¯kW˜ /Nb¯W˜ . The remaining 4 parameters, θ(−µ) = (φ, q0, r, q1), can be obtained by MLE, θb (−µ) = arg maxθ(−µ) `((µbLogs, θ(−µ)); x). The overall estimate is θb = (µbLogs, θb (−µ)). 4 Estimating YouTube Audience in Germany Here we use data from a German online panel (GfK Consumer Panels, 2013), which monitors web usage of P = 6, 545 individuals in October, 2013 (31 days). In particular, we are interested in the probability that an adult in Germany visited the YouTube homepage www.youtube.de. Empirically,Pb (K = 0) = 0.81, yielding 19% observed 1+ reach. However, we know by comparison to YouTube log files that the panel only recorded 27.2% of all impressions. We fix the expected non-missing rate at µb = 0.272 and obtain the remaining parameters via MLE (Table 1): Figure 1 shows the model fit for the true, observed, and predictive distribution. In particular, the true 1+ reach is 36% (qb0 = 0.64), not 19% as the na¨ıve estimate suggests. 5 Discussion We introduce a probabilistic framework to impute missing events in count data, including a hurdle component for more flexibility to model lots of zeros. Researchers can use our models to obtain accurate probabilistic predictions of the number of true, unobserved events. We apply our methodology to accurately estimate how many people in Germany visit YouTube. Acknowledgments: We want to thank Christoph Best, Penny Chu, Tony Fagan, Yijia Feng, Oli Gaymond, Simon Morris, Raimundo Mirisola, Andras Orban, Simon Rowe, Sheethal Shobowale, Yunting Sun, Wiesner Vos, Xiaojing Wang, and Fan Zhang for constructive discussions and feedback. References Fader, P. and Hardie, B. (2000). A note on modelling underreported Poisson counts. Journal of Applied Statistics, 27(8):953–964. Ferrari, S. and Cribari-Neto, F. (2004). Beta Regression for Modelling Rates and Proportions. Journal of Applied Statistics, 31(7):799–815. GfK Consumer Panels (2013). Media Efficiency Panel. Hofler, R. A. and Scrogin, D. (2008). A count data frontier model. Technical report, University of Central Florida. Hu, M., Pavlicova, M., and Nunes, E. (2011). Zero-inflated and hurdle models of count data with extra zeros: examples from an HIV-risk reduction intervention trial. Am J Drug Alcohol Abuse, 37(5):367–75. Rose, C., Martin, S., Wannemuehler, K., and Plikaytis, B. (2006). On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. J Biopharm Stat, 16(4):463–81. Schmittlein, D. C., Bemmaor, A. C., and Morrison, D. G. (1985). Why Does the NBD Model Work? Robustness in Representing Product Purchases, Brand Purchases and Imperfectly Recorded Purchases. Marketing Science, 4(3):255–266. Yang, S., Zhao, Y., and Dhar, R. (2010). Modeling the underreporting bias in panel survey data. Marketing Science, 29(3):525–539. 38 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9 practice DOI:10.1145/2643134 Article development led by queue.acm.org Preventing script injection vulnerabilities through software design. BY CHRISTOPH KERN SCRIPT INJECTION VULNERABILITIES are a bane of Web application development: deceptively simple in cause and remedy, they are nevertheless surprisingly difficult to prevent in large-scale Web development. Cross-site scripting (XSS)2,7,8 arises when insufficient data validation, sanitization, or escaping within a Web application allow an attacker to cause browser-side execution of malicious JavaScript in the application’s context. This injected code can then do whatever the attacker wants, using the privileges of the victim. Exploitation of XSS bugs results in complete (though not necessarily persistent) compromise of the victim’s session with the vulnerable application. This article provides an overview of how XSS vulnerabilities arise and why it is so difficult to avoid them in real-world Web application software development. Software design patterns developed at Google to address the problem are then described. A key goal of these design patterns Securing the Tangled WebSEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 39 IMAGE BY PHOTOBANK GALLERY is to confine the potential for XSS bugs to a small fraction of an application’s code base, significantly improving one’s ability to reason about the absence of this class of security bugs. In several software projects within Google, this approach has resulted in a substantial reduction in the incidence of XSS vulnerabilities. Most commonly, XSS vulnerabilities result from insufficiently validating, sanitizing, or escaping strings that are derived from an untrusted source and passed along to a sink that interprets them in a way that may result in script execution. Common sources of untrustworthy data include HTTP request parameters, as well as user-controlled data located in persistent data stores. Strings are often concatenated with or interpolated into larger strings before assignment to a sink. The most frequently encountered sinks relevant to XSS vulnerabilities are those that interpret the assigned value as HTML markup, which includes server-side HTTP responses of MIME-type text/html, and the Element.prototype.innerHTML Document Object Model (DOM)8 property in browser-side JavaScript code. Figure 1a shows a slice of vulnerable code from a hypothetical photosharing application. Like many modern Web applications, much of its user-interface logic is implemented in browser-side JavaScript code, but the observations made in this article transfer readily to applications whose UI is implemented via traditional serverside HTML rendering. In code snippet (1) in the figure, the application generates HTML markup for a notification to be shown to a user when another user invites the former to view a photo album. The generated markup is assigned to the innerHTML property of a DOM 40 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9 practice main page. If the login resulted from a session time-out, however, the app navigates back to the URL the user had visited before the time-out. Using a common technique for short-term state storage in Web applications, this URL is encoded in a parameter of the current URL. The page navigation is implemented via assignment to the window.location.href DOM property, which browsers interpret as instruction to navigate the current window to the provided URL. Unfortunately, navigating a browser to a URL of the form javascript:attackScript causes execution of the URL’s body as Java Script. In this scenario, the target URL is extracted from a parameter of the current URL, which is generally under attacker control (a malicious page visited by a victim can instruct the browser to navigate to an attacker-chosen URL). Thus, this code is also vulnerable to XSS. To fix the bug, it is necessary to validate that the URL will not result in script execution when dereferenced, by ensuring that its scheme is benign— for example, https. Why Is XSS So Difficult to Avoid? Avoiding the introduction of XSS into nontrivial applications is a difficult problem in practice: XSS remains among the top vulnerabilities in Web applications, according to the Open Web Application Security Project (OWASP);4 within Google it is the most common class of Web application vulnerabilities among those reported under Google’s Vulnerability Reward Program (https://goo.gl/82zcPK). Traditionally, advice (including my own) on how to prevent XSS has largely focused on: ˲ Training developers how to treat (by sanitization, validation, and/or escaping) untrustworthy values interpolated into HTML markup.2,5 ˲ Security-reviewing and/or testing code for adherence to such guidance. In our experience at Google, this approach certainly helps reduce the incidence of XSS, but for even moderately complex Web applications, it does not prevent introduction of XSS to a reasonably high degree of confidence. We see a combination of factors leading to this situation. element (a node in the hierarchical object representation of UI elements in a browser window), resulting in its evaluation and rendering. The notification contains the album’s title, chosen by the second user. A malicious user can create an album titled: Since no escaping or validation is applied, this attacker-chosen HTML is interpolated as-is into the markup generated in code snippet (1). This markup is assigned to the innerHTML sink, and hence evaluated in the context of the victim’s session, executing the attacker-chosen JavaScript code. To fix this bug, the album’s title must be HTML-escaped before use in markup, ensuring that it is interpreted as plain text, not markup. HTMLescaping replaces HTML metacharacters such as <, >, ", ', and & with corresponding character entity references or numeric character references: <, >, ", ', and &. The result will then be parsed as a substring in a text node or attribute value and will not introduce element or attribute boundaries. As noted, most data flows with a potential for XSS are into sinks that interpret data as HTML markup. But other types of sinks can result in XSS bugs as well: Figure 1b shows another slice of the previously mentioned photo-sharing application, responsible for navigating the user interface after a login operation. After a fresh login, the app navigates to a preconfigured URL for the application’s The following code snippet intends to populate a DOM element with markup for a hyperlink (an HTML anchor element): var escapedCat = goog.string.htmlEscape(category); var jsEscapedCat = goog.string.escapeString(escapedCat); catElem.innerHTML = '' + escapedCat + ''; The anchor element’s click-event handler, which is invoked by the browser when a user clicks on this UI element, is set up to call a JavaScript function with the value of category as an argument. Before interpolation into the HTML markup, the value of category is HTML-escaped using an escaping function from the JavaScript Closure Library. Furthermore, it is JavaScript-string-literal-escaped (replacing ' with \' and so forth) before interpolation into the string literal within the onclick handler’s JavaScript expression. As intended, for a value of Flowers & Plants for variable category, the resulting HTML markup is: Flowers & Plants So where’s the bug? Consider a value for category of: ');attackScript();// Passing this value through htmlEscape results in: ');attackScript();// because htmlEscape escapes the single quote into an HTML character reference. After this, JavaScript-string-literal escaping is a no-op, since the single quote at the beginning of the page is already HTML-escaped. As such, the resulting markup becomes: ');attackScript();// When evaluating this markup, a browser will first HTML-unescape the value of the onclick attribute before evaluation as a JavaScript expression. Hence, the JavaScript expression that is evaluated results in execution of the attacker’s script: createCategoryList('');attackScript();//') Thus, the underlying bug is quite subtle: the programmer invoked the appropriate escaping functions, but in the wrong order. A Subtle XSS BugSEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 41 practice Subtle security considerations. As seen, the requirements for secure handling of an untrustworthy value depend on the context in which the value is used. The most commonly encountered context is string interpolation within the content of HTML markup elements; here, simple HTML-escaping suffices to prevent XSS bugs. Several special contexts, however, apply to various DOM elements and within certain kinds of markup, where embedded strings are interpreted as URLs, Cascading Style Sheets (CSS) expressions, or JavaScript code. To avoid XSS bugs, each of these contexts requires specific validation or escaping, or a combination of the two.2,5 The accompanying sidebar, “A Subtle XSS Bug,” shows this can be quite tricky to get right. Complex, difficult-to-reason-about data flows. Recall that XSS arises from flows of untrustworthy, unvalidated/escaped data into injection-prone sinks. To assert the absence of XSS bugs in an application, a security reviewer must first find all such data sinks, and then inspect the surrounding code for context-appropriate validation and escaping of data transferred to the sink. When encountering an assignment that lacks validation and escaping, the reviewer must backward-trace this data flow until one of the following situations can be determined: ˲ The value is entirely under application control and hence cannot result in attacker-controlled injection. ˲ The value is validated, escaped, or otherwise safely constructed somewhere along the way. ˲ The value is in fact not correctly validated and escaped, and an XSS vulnerability is likely present. Let’s inspect the data flow into the innerHTML sink in code snippet (1) in Figure 1a. For illustration purposes, code snippets and data flows that require investigation are shown in red. Since no escaping is applied to sharedAlbum.title, we trace its origin to the albums entity (4) in persistent storage, via Web front-end code (2). This is, however, not the data’s ultimate origin—the album name was previously entered by a different user (that is, originated in a different time context). Since no escaping was applied to this value anywhere along its flow from an ultimately untrusted source, an XSS vulnerability arises. Similar considerations apply to the data flows in Figure 1b: no validation occurs immediately prior to the assignment to window.location.href in (5), so back-tracing is necessary. In code snippet (6), the code exploration branches: in the true branch, the value originates in a configuration entity in the data store (3) via the Web front end (8); this value can be assumed application-controlled and trustworthy and is safe to use without further validation. It is noteworthy that the persistent storage contains both trustworthy and untrustworthy data in different entities of the same schema—no blanket assumptions can be made about the provenance of stored data. In the else-branch, the URL originates from a parameter of the current URL, obtained from window.location.href, which is an attacker-controlled source (7). Since there is no validation, this code path results in an XSS vulnerability. Many opportunities for mistakes. Figures 1a and 1b show only two small slices of a hypothetical Web application. In reality, a large, nontrivial Web application will have hundreds if not thousands of branching and merging data flows into injection-prone sinks. Each such flow can potentially result in an XSS bug if a developer makes a mistake related to validation or escaping. Exploring all these data flows and asserting absence of XSS is a monumental task for a security reviewer, especially considering an ever-changing code base of a project under active development. Automated tools that employ heuristics to statically analyze data flows in a code base can help. In our experience at Google, however, they do not substantially increase confidence in review-based assessments, since they are necessarily incomplete in their reasoning and subject to both false positives and false negatives. Furthermore, they have similar difficulties as human reviewers with reasoning about whole-system data flows across multiple system components, using a variety of programming languages, RPC (remote procedure call) mechanisms, and so forth, and involving flows traversing multiple time contexts across data stores. The primary goal of this approach is to limit code that could potentially give rise to XSS vulnerabilities to a very small fraction of an application’s code base.42 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9 practice user-profile field). Unfortunately, there is an XSS bug: the markup in profile.aboutHtml ultimately originates in a rich-text editor implemented in browser-side code, but there is no server-side enforcement preventing an attacker from injecting malicious markup using a tampered-with client. This bug could arise in practice from a misunderstanding between front-end and back-end developers regarding responsibilities for data validation and sanitization. Reliably Preventing the Introduction of XSS Bugs In our experience in Google’s security team, code inspection and testing do not ensure, to a reasonably high degree of confidence, the absence of XSS bugs in large Web applications. Of course, both inspection and testing provide tremendous value and will typically find some bugs in an application (perhaps even most of the bugs), but it is difficult to be sure whether or not they discovered all the bugs (or even almost all of them). The primary goal of this approach is to limit code that could potentially give rise to XSS vulnerabilities to a very small fraction of an application’s code base. A key goal of this approach is to drastically reduce the fraction of code that could potentially give rise to XSS bugs. In particular, with this approach, an application is structured such that most of its code cannot be responsible for XSS bugs. The potential for vulnerabilities is therefore confined to infrastructure code such as Web application frameworks and HTML templating engines, as well as small, self-contained applicationspecific utility modules. A second, equally important goal is to provide a developer experience that does not add an unacceptable degree of friction as compared with existing developer workflows. Key components of this approach are: ˲ Inherently safe APIs. Injection-prone Web-platform and HTML-rendering APIs are encapsulated in wrapper APIs designed to be inherently safe against XSS in the sense that no use of such APIs can result in XSS vulnerabilities. ˲ Security type contracts. Special types are defined with contracts stipuSimilar limitations apply to dynamic testing approaches: it is difficult to ascertain whether test suites provide adequate coverage for whole-system data flows. Templates to the rescue? In practice, HTML markup, and interpolation points therein, are often specified using HTML templates. Template systems expose domain-specific languages for rendering HTML markup. An HTML markup template induces a function from template variables into strings of HTML markup. Figure 1c illustrates the use of an HTML markup template (9): this example renders a user profile in the photo-sharing application, including the user’s name, a hyperlink to a personal blog site, as well as free-form text allowing the user to express any special interests. Some template engines support automatic escaping, where escaping operations are automatically inserted around each interpolation point into the template. Most template engines’ auto-escape facilities are noncontextual and indiscriminately apply HTML escaping operations, but do not account for special HTML contexts such as URLs, CSS, and JavaScript. Contextually auto-escaping template engines6 infer the necessary validation and escaping operations required for the context of each template substitution, and therefore account for such special contexts. Use of contextually auto-escaping template systems dramatically reduces the potential for XSS vulnerabilities: in (9), the substitution of untrustworthy values profile.name and profile. blogUrl into the resulting markup cannot result in XSS—the template system automatically infers the required HTML-escaping and URL-validation. XSS bugs can still arise, however, in code that does not make use of templates, as in Figure 1a (1), or that involves non-HTML sinks, as in Figure 1b (5). Furthermore, developers occasionally need to exempt certain substitutions from automatic escaping: in Figure 1c (9), escaping of profile.aboutHtml is explicitly suppressed because that field is assumed to contain a user-supplied message with simple, safe HTML markup (to support use of fonts, colors, and hyperlinks in the “about myself” lating that their values are safe to use in specific contexts without further escaping and validation. ˲ Coding guidelines. Coding guidelines restrict direct use of injectionprone APIs, and ensure security review of certain security-sensitive APIs. Adherence to these guidelines can be enforced through simple static checks. Inherently safe APIs. Our goal is to provide inherently safe wrapper APIs for injection-prone browser-side Web platform API sinks, as well as for server- and client-side HTML markup rendering. For some APIs, this is straightforward. For example, the vulnerable assignment in Figure 1b (5) can be replaced with the use of an inherently safe wrapper API, provided by the JavaScript Closure Library, as shown in Figure 2b (5’). The wrapper API validates at runtime that the supplied URL represents either a scheme-less URL or one with a known benign scheme. Using the safe wrapper API ensures this code will not result in an XSS vulnerability, regardless of the provenance of the assigned URL. Crucially, none of the code in (5’) nor its fan-in in (6-8) needs to be inspected for XSS bugs. This benefit comes at the very small cost of a runtime validation that is technically unnecessary if (and only if) the first branch is taken—the URL obtained from the configuration store is validated even though it is actually a trustworthy value. In some special scenarios, the runtime validation imposed by an inherently safe API may be too strict. Such cases are accommodated via variants of inherently safe APIs that accept types with a security contract appropriate for the desired use context. Based on their contract, such values are exempt from runtime validation. This approach is discussed in more detail in the next section. Strictly contextually auto-escaping template engines. Designing an inherently safe API for HTML rendering is more challenging. The goal is to devise APIs that guarantee that at each substitution point of data into a particular context within trusted HTML markup, data is appropriately validated, sanitized, and/or escaped, unless it can be demonstrated that a specific data item is safe to use in that context based on SEPTEMBER 2014 | VOL. 57 | NO. 9 | COMMUNICATIONS OF THE ACM 43 practice Figure 1. XSS vulnerabilities in a hypothetical Web application. Browser Web-App Frontend Application Backends (4) (3) (1) Application data store (2) Browser Web-App Frontend Application Backends (4) (3) (5) (6) (7) Application data store (8) Browser Web-App Frontend Application Backends (12) (13) (9) (10) Profile Store (11) (a) Vulnerable code of a hypothetical photo-sharing application. (b) Another slice of the photo-sharing application. (c) Using an HTML markup template. 44 COMMUNICATIONS OF THE ACM | SEPTEMBER 2014 | VOL. 57 | NO. 9 practice sanitizer to remove any markup that may result in script execution renders it safe to use in HTML context and thus produces a value that satisfies the SafeHtml type contract. To actually create values of these types, unchecked conversion factory methods are provided that consume an arbitrary string and return an instance of a given wrapper type (for example, SafeHtml or SafeUrl) without applying any runtime sanitization or escaping. Every use of such unchecked conversions must be carefully security reviewed to ensure that in all possible program states, strings passed to the conversion satisfy the resulting type’s contract, based on context-specific processing or construction. As such, unchecked conversions should be used as rarely as possible, and only in scenarios where their use is readily reasoned about for security-review purposes. For example, in Figure 2c, the unchecked conversion is encapsulated in a library (12’’) along with the HTML sanitizer implementation on whose correctness its use depends, permitting security review and testing in isolation. Coding guidelines. For this approach to be effective, it must ensure developers never write application code that directly calls potentially injection-prone sinks, and that they instead use the corresponding safe wrapper API. Furthermore, it must ensure uses of unchecked conversions are designed with reviewability in mind, and are in fact security reviewed. Both constraints represent coding guidelines with which all of an application’s code base must comply. In our experience, automated enforcement of coding guidelines is necessary even in moderate-size projects—otherwise, violations are bound to creep in over time. At Google we use the open source error-prone static checker1 (https:// goo.gl/SQXCvw), which is integrated into Google’s Java tool chain, and a feature of Google’s open source Closure Compiler (https://goo.gl/UyMVzp) to whitelist uses of specific methods and properties in JavaScript. Errors arising from use of a “banned” API include references to documentation for the corresponding safe API, advising developers on how to address its provenance or prior validation, sanitization, or escaping. These inherently safe APIs are created by strengthening the concept of contextually auto-escaping template engines6 into SCAETEs (strictly contextually auto-escaping template engines). Essentially, a SCAETE places two additional constraints on template code: ˲ Directives that disable or modify the automatically inferred contextual escaping and validation are not permitted. ˲ A template may use only sub-templates that recursively adhere to the same constraint. Security type contracts. In the form just described, SCAETEs do not account for scenarios where template parameters are intended to be used without validation or escaping, such as aboutHtml in Figure 1c—the SCAETE unconditionally validates and escapes all template parameters, and disallows directives to disable the auto-escaping mechanism. Such use cases are accommodated through types whose contracts stipulate their values are safe to use in corresponding HTML contexts, such as “inner HTML,” hyperlink URLs, executable resource URLs, and so forth. Type contracts are informal: a value satisfies a given type contract if it is known that it has been validated, sanitized, escaped, or constructed in a way that guarantees its use in the type’s target context will not result in attackercontrolled script execution. Whether or not this is indeed the case is established by expert reasoning about code that creates values of such types, based on expert knowledge of the relevant behaviors of the Web platform.8 As will be seen, such security-sensitive code is encapsulated in a small number of special-purpose libraries; application code uses those libraries but is itself not relied upon to correctly create instances of such types and hence does not need to be security-reviewed. The following are examples of types and type contracts in use: ˲ SafeHtml. A value of type SafeHtml, converted to string, will not result in attacker-controlled script execution when used as HTML markup. ˲ SafeUrl. Values of this type will not result in attacker-controlled script execution when dereferenced as hyperlink URLs. ˲ TrustedResourceUrl. Values of this type are safe to use as the URL of an executable or “control” resource, such as the src attribute of a

 

Voir également :

[TXT]

 01AINrues.htm           07-Oct-2011 14:09  5.5M  

[TXT]

 75PARISRUEMONTGALLET..> 19-Oct-2011 11:58   32K  

[TXT]

 75PARISavenuedescham..> 19-Oct-2011 11:50  521K  

[TXT]

 75PARISruedeRennesen..> 19-Oct-2011 12:09  203K  

[TXT]

 75PARISruegeneralDel..> 19-Oct-2011 12:04   17K  

[TXT]

 75PARISrues.htm         11-Jul-2011 20:49  3.6M  

[TXT]

 78YVELINESrues.htm      11-Jul-2011 20:55  2.0M  

[TXT]

 92HAUTSDESEINErues.htm  12-Jul-2011 09:21  4.1M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:19  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:18  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:17  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 09:17  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:03  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:09  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:09  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:21  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:24  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23  1.7M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:23  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22  1.6M  

[TXT]

 Rues-06-Alpes-Mariti..> 04-Jul-2013 08:22  1.6M  

[TXT]

 Rues-1.htm              04-Jun-2013 12:57  2.9M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:31  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:31  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:31  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:30  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:30  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:30  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:30  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:29  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:29  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:29  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:29  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:28  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:28  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:28  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 07:28  1.5M  

[TXT]

 Rues-13-Bouches-du-R..> 27-Jun-2013 10:12  1.9M  

[TXT]

 Rues-33-Gironde-1.htm   27-Jun-2013 16:39  1.9M  

[TXT]

 Rues-33-Gironde-2.htm   27-Jun-2013 16:39  1.9M  

[TXT]

 Rues-33-Gironde-3.htm   27-Jun-2013 16:38  1.9M  

[TXT]

 Rues-33-Gironde-4.htm   27-Jun-2013 16:38  1.9M  

[TXT]

 Rues-33-Gironde-5.htm   27-Jun-2013 16:46  1.9M  

[TXT]

 Rues-33-Gironde-6.htm   27-Jun-2013 16:46  1.9M  

[TXT]

 Rues-33-Gironde-7.htm   27-Jun-2013 16:46  1.9M  

[TXT]

 Rues-33-Gironde-8.htm   27-Jun-2013 16:46  1.9M  

[TXT]

 Rues-33-Gironde-9.htm   27-Jun-2013 16:45  1.9M  

[TXT]

 Rues-33-Gironde-10.htm  27-Jun-2013 16:45  1.9M  

[TXT]

 Rues-33-Gironde-11.htm  27-Jun-2013 16:53  1.9M  

[TXT]

 Rues-33-Gironde-12.htm  27-Jun-2013 16:53  1.9M  

[TXT]

 Rues-33-Gironde-13.htm  27-Jun-2013 16:53  1.9M  

[TXT]

 Rues-33-Gironde-14.htm  27-Jun-2013 16:52  1.9M  

[TXT]

 Rues-33-Gironde-15.htm  28-Jun-2013 08:05  1.9M  

[TXT]

 Rues-33-Gironde-16.htm  28-Jun-2013 08:03  1.9M  

[TXT]

 Rues-33-Gironde-17.htm  28-Jun-2013 08:03  1.9M  

[TXT]

 Rues-33-Gironde-18.htm  28-Jun-2013 07:37  1.9M  

[TXT]

 Rues-33-Gironde-19.htm  28-Jun-2013 07:37  1.9M  

[TXT]

 Rues-33-Gironde-20.htm  28-Jun-2013 07:36  1.9M  

[TXT]

 Rues-33-Gironde-21.htm  28-Jun-2013 07:36  1.9M  

[TXT]

 Rues-33-Gironde-22.htm  28-Jun-2013 07:35  1.9M  

[TXT]

 Rues-33-Gironde-23.htm  28-Jun-2013 07:35  1.9M  

[TXT]

 Rues-33-Gironde-24.htm  28-Jun-2013 07:35  1.9M  

[TXT]

 Rues-33-Gironde-25.htm  28-Jun-2013 07:35  1.9M  

[TXT]

 Rues-33-Gironde-26.htm  28-Jun-2013 07:34  1.9M  

[TXT]

 Rues-33-Gironde-27.htm  28-Jun-2013 08:13  1.9M  

[TXT]

 Rues-33-Gironde-28.htm  28-Jun-2013 08:12  1.9M  

[TXT]

 Rues-33-Gironde-29.htm  28-Jun-2013 08:12  1.9M  

[TXT]

 Rues-33-Gironde-30.htm  28-Jun-2013 08:12  1.9M  

[TXT]

 Rues-33-Gironde-31.htm  28-Jun-2013 08:12  1.9M  

[TXT]

 Rues-33-Gironde-32.htm  28-Jun-2013 08:23  1.9M  

[TXT]

 Rues-33-Gironde-33.htm  28-Jun-2013 08:23  1.9M  

[TXT]

 Rues-33-Gironde-34.htm  28-Jun-2013 08:25  2.3M  

[TXT]

 Rues-33-Gironde-35.htm  28-Jun-2013 08:25  2.3M  

[TXT]

 Rues-33-Gironde-36.htm  28-Jun-2013 08:24  2.2M  

[TXT]

 Rues-33-Gironde-37.htm  28-Jun-2013 08:24  2.2M  

[TXT]

 Rues-33-Gironde-38.htm  28-Jun-2013 08:26  2.3M  

[TXT]

 Rues-33-Gironde-43.htm  28-Jun-2013 09:39  2.2M  

[TXT]

 Rues-33-Gironde-44.htm  28-Jun-2013 09:39  2.3M  

[TXT]

 Rues-33-Gironde-45.htm  28-Jun-2013 09:38  2.3M  

[TXT]

 Rues-33-Gironde-46.htm  28-Jun-2013 09:38  2.2M  

[TXT]

 Rues-33-Gironde-47.htm  28-Jun-2013 09:38  2.3M  

[TXT]

 Rues-33-Gironde-48.htm  28-Jun-2013 09:38  2.3M  

[TXT]

 Rues-33-Gironde-49.htm  28-Jun-2013 09:37  2.3M  

[TXT]

 Rues-33-Gironde-50.htm  28-Jun-2013 09:37  2.3M  

[TXT]

 Rues-33-Gironde-51.htm  28-Jun-2013 09:36  2.3M  

[TXT]

 Rues-33-Gironde-52.htm  28-Jun-2013 09:42  2.3M  

[TXT]

 Rues-59-Nord-59000-L..> 03-Jul-2013 07:15  2.2M  

[TXT]

 Rues-59-Nord-59000-T..> 03-Jul-2013 07:17  2.1M  

[TXT]

 Rues-59-Nord-59200-T..> 03-Jul-2013 07:17  2.1M  

[TXT]

 Rues-59-Nord-59300-V..> 03-Jul-2013 07:17  2.1M  

[TXT]

 Rues-59-Nord-59400-C..> 03-Jul-2013 07:16  2.1M  

[TXT]

 Rues-59-Nord-59500-D..> 03-Jul-2013 07:16  2.1M  

[TXT]

 Rues-59-Nord-59600-M..> 03-Jul-2013 07:16  2.1M  

[TXT]

 Rues-59-Nord-59700-M..> 03-Jul-2013 07:15  2.1M  

[TXT]

 Rues-59-Nord-59800-L..> 03-Jul-2013 07:15  2.1M  

[TXT]

 Rues-59-Nord-A.htm      02-Jul-2013 09:41  2.2M  

[TXT]

 Rues-59-Nord-Allee.htm  29-Jun-2013 22:48  1.9M  

[TXT]

 Rues-59-Nord-Avenue.htm 29-Jun-2013 22:48  1.9M  

[TXT]

 Rues-59-Nord-B.htm      02-Jul-2013 09:41  2.1M  

[TXT]

 Rues-59-Nord-Bouleva..> 29-Jun-2013 22:47  1.9M  

[TXT]

 Rues-59-Nord-C.htm      02-Jul-2013 11:07  2.2M  

[TXT]

 Rues-59-Nord-Chausse..> 29-Jun-2013 22:47  1.9M  

[TXT]

 Rues-59-Nord-Chemin.htm 29-Jun-2013 22:47  1.9M  

[TXT]

 Rues-59-Nord-Chiffre..> 02-Jul-2013 09:41  2.1M  

[TXT]

 Rues-59-Nord-Cite.htm   29-Jun-2013 22:46  1.9M  

[TXT]

 Rues-59-Nord-Clos.htm   29-Jun-2013 22:46  1.9M  

[TXT]

 Rues-59-Nord-Cour.htm   29-Jun-2013 22:46  1.9M  

[TXT]

 Rues-59-Nord-D.htm      02-Jul-2013 11:16  2.3M  

[TXT]

 Rues-59-Nord-E.htm      02-Jul-2013 14:46  2.1M  

[TXT]

 Rues-59-Nord-F.htm      02-Jul-2013 14:46  2.1M  

[TXT]

 Rues-59-Nord-Hameau.htm 29-Jun-2013 22:45  1.9M  

[TXT]

 Rues-59-Nord-Impasse..> 30-Jun-2013 07:16  2.1M  

[TXT]

 Rues-59-Nord-Lieu.htm   01-Jul-2013 11:38  2.1M  

[TXT]

 Rues-59-Nord-Place.htm  01-Jul-2013 11:38  2.2M  

[TXT]

 Rues-59-Nord-Quai.htm   01-Jul-2013 11:37  2.1M  

[TXT]

 Rues-59-Nord-Residen..> 01-Jul-2013 11:37  2.1M  

[TXT]

 Rues-59-Nord-Route.htm  01-Jul-2013 11:38  2.1M  

[TXT]

 Rues-69-Rhone-1.htm     26-Jun-2013 18:48  1.5M  

[TXT]

 Rues-69-Rhone-2.htm     26-Jun-2013 18:48  1.5M  

[TXT]

 Rues-69-Rhone-3.htm     26-Jun-2013 18:48  1.5M  

[TXT]

 Rues-69-Rhone-4.htm     26-Jun-2013 18:47  1.5M  

[TXT]

 Rues-69-Rhone-5.htm     26-Jun-2013 18:47  1.5M  

[TXT]

 Rues-69-Rhone-6.htm     26-Jun-2013 18:47  1.5M  

[TXT]

 Rues-69-Rhone-7.htm     26-Jun-2013 18:47  1.5M  

[TXT]

 Rues-69-Rhone-8.htm     26-Jun-2013 18:46  1.5M  

[TXT]

 Rues-69-Rhone-9.htm     26-Jun-2013 18:46  1.5M  

[TXT]

 Rues-69-Rhone-10.htm    26-Jun-2013 18:46  1.5M  

[TXT]

 Rues-69-Rhone-11.htm    26-Jun-2013 18:46  1.5M  

[TXT]

 Rues-69-Rhone-12.htm    26-Jun-2013 18:45  1.5M  

[TXT]

 Rues-69-Rhone-13.htm    26-Jun-2013 18:45  1.5M  

[TXT]

 Rues-69-Rhone-14.htm    26-Jun-2013 18:45  1.5M  

[TXT]

 Rues-69-Rhone-15.htm    26-Jun-2013 18:45  1.5M  

[TXT]

 Rues-69-Rhone-16.htm    26-Jun-2013 18:44  1.5M  

[TXT]

 Rues-69-Rhone-17.htm    26-Jun-2013 18:44  1.5M  

[TXT]

 Rues-69-Rhone-18.htm    26-Jun-2013 18:44  1.5M  

[TXT]

 Rues-69-Rhone-19.htm    26-Jun-2013 18:44  1.5M  

[TXT]

 Rues-69-Rhone-20.htm    26-Jun-2013 18:43  1.5M  

[TXT]

 Rues-69-Rhone-21.htm    26-Jun-2013 18:49  1.5M  

[TXT]

 Rues-75-Paris-1.htm     26-Jun-2013 17:33  1.4M  

[TXT]

 Rues-75-Paris-2.htm     26-Jun-2013 17:33  1.5M  

[TXT]

 Rues-75-Paris-3.htm     26-Jun-2013 17:33  1.4M  

[TXT]

 Rues-75-Paris-4.htm     26-Jun-2013 17:32  1.4M  

[TXT]

 Rues-75-Paris-5.htm     26-Jun-2013 17:32  1.5M  

[TXT]

 Rues-75-Paris-6.htm     26-Jun-2013 17:32  1.5M  

[TXT]

 Rues-75-Paris-7.htm     26-Jun-2013 17:32  1.4M  

[TXT]

 Rues-75-Paris-8.htm     26-Jun-2013 17:31  1.4M  

[TXT]

 Rues-75-Paris-9.htm     26-Jun-2013 17:31  1.4M  

[TXT]

 Rues-75-Paris-10.htm    26-Jun-2013 17:31  1.5M  

[TXT]

 Rues-75-Paris-11.htm    26-Jun-2013 17:31  1.5M  

[TXT]

 Rues-75-Paris-12.htm    26-Jun-2013 17:30  1.5M  

[TXT]

 Rues-75-Paris-13.htm    26-Jun-2013 17:30  1.4M  

[TXT]

 Rues-75-Paris-Allee.htm 29-Jun-2013 21:59  2.2M  

[TXT]

 Rues-75-Paris-Avenue..> 29-Jun-2013 21:58  2.3M  

[TXT]

 Rues-75-Paris-Boulev..> 29-Jun-2013 21:58  2.3M  

[TXT]

 Rues-75-Paris-Chemin..> 29-Jun-2013 22:16  1.9M  

[TXT]

 Rues-75-Paris-Cite.htm  29-Jun-2013 22:15  1.9M  

[TXT]

 Rues-75-Paris-Cour.htm  29-Jun-2013 22:15  1.9M  

[TXT]

 Rues-77-Seine-et-Mar..> 04-Jul-2013 11:54  1.9M  

[TXT]

 Rues-77-Seine-et-Mar..> 04-Jul-2013 11:53  1.9M  

[TXT]

 Rues-77-Seine-et-Mar..> 04-Jul-2013 11:53  1.9M  

[TXT]

 Rues-77-Seine-et-Mar..> 04-Jul-2013 11:53  1.9M  

[TXT]

 Rues-77-Seine-et-Mar..> 04-Jul-2013 11:53  1.9M  

[TXT]

 Rues-77-Seine-et-Mar..> 04-Jul-2013 11:52  1.9M  

[TXT]

 Rues-77-Seine-et-Mar..> 04-Jul-2013 11:52  1.9M  

[TXT]

 Rues-78-Yvelines-1.htm  26-Jun-2013 09:14  1.7M  

[TXT]

 Rues-78-Yvelines-2.htm  26-Jun-2013 09:14  1.7M  

[TXT]

 Rues-78-Yvelines-3.htm  26-Jun-2013 09:14  1.7M  

[TXT]

 Rues-78-Yvelines-4.htm  26-Jun-2013 09:13  1.7M  

[TXT]

 Rues-78-Yvelines-5.htm  26-Jun-2013 09:13  1.7M  

[TXT]

 Rues-78-Yvelines-6.htm  26-Jun-2013 09:13  1.7M  

[TXT]

 Rues-78-Yvelines-7.htm  26-Jun-2013 09:13  1.7M  

[TXT]

 Rues-78-Yvelines-8.htm  26-Jun-2013 09:12  1.7M  

[TXT]

 Rues-78-Yvelines-9.htm  26-Jun-2013 09:12  1.7M  

[TXT]

 Rues-78-Yvelines-10.htm 26-Jun-2013 09:12  1.7M  

[TXT]

 Rues-78-Yvelines-11.htm 26-Jun-2013 10:24  1.5M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 08:06  2.1M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 17:39  2.2M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 17:40  2.1M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 17:53  2.2M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 17:52  2.2M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 17:52  2.2M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 18:05  2.2M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 18:15  2.2M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 18:00  2.2M  

[TXT]

 Rues-92-Hauts-de-Sei..> 03-Jul-2013 08:06  2.3M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:55  1.6M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:55  1.6M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:55  1.6M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:54  1.6M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:54  1.6M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:54  1.6M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:54  1.6M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:53  1.6M  

[TXT]

 Rues-93-Seine-Saint-..> 04-Jul-2013 09:53  1.6M  

[TXT]

 Rues-78000-Versaille..> 26-Jun-2013 07:45  1.7M  

[TXT]

 Rues-78000-Versaille..> 26-Jun-2013 07:45  1.7M  

[TXT]

 Rues-78000-Versaille..> 26-Jun-2013 07:45  1.7M  

[TXT]

 Rues-78000-Versaille..> 26-Jun-2013 07:45  1.7M  

[TXT]

 Rues-78000-Versaille..> 26-Jun-2013 07:44  1.7M  

[TXT]

 Rues-78000-Versaille..> 26-Jun-2013 08:42  1.7M  

[TXT]

 Rues-Abbaye.htm         30-May-2013 18:33  616K  

[TXT]

 Rues-Aerodrome.htm      03-Jun-2013 06:59  1.5M  

[TXT]

 Rues-Aeroport.htm       30-May-2013 18:32  1.2M  

[TXT]

 Rues-Aire.htm           05-Jun-2013 16:22  2.5M  

[TXT]

 Rues-Alain.htm          03-Jun-2013 07:00  1.8M  

[TXT]

 Rues-Allee-1.htm        30-May-2013 18:35  1.0M  

[TXT]

 Rues-Ancien.htm         03-Jun-2013 06:54  1.9M  

[TXT]

 Rues-Arcade.htm         05-Jun-2013 16:23  1.9M  

[TXT]

 Rues-Artisan.htm        30-May-2013 18:30  1.8M  

[TXT]

 Rues-Auto.htm           05-Jun-2013 16:22  2.5M  

[TXT]

 Rues-Avenue-1.htm       30-May-2013 18:35  1.1M  

[TXT]

 Rues-Avenue-2.htm       03-Jun-2013 07:02  2.9M  

[TXT]

 Rues-Bas.htm            04-Jun-2013 12:52  2.5M  

[TXT]

 Rues-Batiment.htm       03-Jun-2013 06:57  1.9M  

[TXT]

 Rues-Belle.htm          04-Jun-2013 12:57  3.0M  

[TXT]

 Rues-Bernard.htm        05-Jun-2013 16:28  3.0M  

[TXT]

 Rues-Bis.htm            26-Jun-2013 07:25  1.4M  

[TXT]

 Rues-Blanc.htm          04-Jun-2013 12:58  2.6M  

[TXT]

 Rues-Bleu.htm           04-Jun-2013 12:58  2.6M  

[TXT]

 Rues-Bois.htm           03-Jun-2013 06:57  2.3M  

[TXT]

 Rues-Boulevard-1.htm    30-May-2013 18:35  791K  

[TXT]

 Rues-Boulevard-2.htm    30-May-2013 18:35  771K  

[TXT]

 Rues-Boulevard-3.htm    26-Jun-2013 07:25  1.6M  

[TXT]

 Rues-Bourg-2.htm        30-May-2013 18:30  1.4M  

[TXT]

 Rues-Bourg.htm          30-May-2013 18:32  1.2M  

[TXT]

 Rues-Bout.htm           04-Jun-2013 12:55  3.1M  

[TXT]

 Rues-Bra-Commence-pa..> 26-Jun-2013 07:25  1.6M  

[TXT]

 Rues-Bre-Commence-pa..> 26-Jun-2013 07:24  1.6M  

[TXT]

 Rues-Bri-Commence-pa..> 26-Jun-2013 07:24  1.6M  

[TXT]

 Rues-Bru-Commence-pa..> 26-Jun-2013 07:24  1.6M  

[TXT]

 Rues-Ca-Commence-par..> 26-Jun-2013 07:31  1.7M  

[TXT]

 Rues-Camp.htm           04-Jun-2013 12:56  3.1M  

[TXT]

 Rues-Carrefour.htm      05-Jun-2013 16:25  3.1M  

[TXT]

 Rues-Ce-Commence-par..> 26-Jun-2013 07:28  1.7M  

[TXT]

 Rues-Centre.htm         05-Jun-2013 16:25  3.1M  

[TXT]

 Rues-Champ.htm          03-Jun-2013 06:58  1.7M  

[TXT]

 Rues-Chapelle.htm       03-Jun-2013 06:54  1.9M  

[TXT]

 Rues-Chateau.htm        03-Jun-2013 06:54  2.2M  

[TXT]

 Rues-Chem.htm           26-Jun-2013 07:26  2.7M  

[TXT]

 Rues-Chemin-1.htm       30-May-2013 18:36  873K  

[TXT]

 Rues-Chemin-2.htm       30-May-2013 18:36  1.0M  

[TXT]

 Rues-Chemin-3.htm       30-May-2013 18:36  1.0M  

[TXT]

 Rues-Chemin-4.htm       30-May-2013 18:35  859K  

[TXT]

 Rues-Cite.htm           03-Jun-2013 06:58  1.8M  

[TXT]

 Rues-Clot.htm           05-Jun-2013 16:26  3.1M  

[TXT]

 Rues-Cour.htm           04-Jun-2013 13:02  2.6M  

[TXT]

 Rues-Court.htm          03-Jun-2013 06:59  1.8M  

[TXT]

 Rues-Croix.htm          30-May-2013 18:31  1.2M  

[TXT]

 Rues-Culture.htm        05-Jun-2013 16:24  3.1M  

[TXT]

 Rues-De-Gaulle.htm      03-Jun-2013 07:02  2.8M  

[TXT]

 Rues-Departements-01..> 30-May-2013 07:10  2.9M  

[TXT]

 Rues-Departements-05..> 30-May-2013 07:29  2.7M  

[TXT]

 Rues-Departements-09..> 30-May-2013 07:28  2.9M  

[TXT]

 Rues-Departements-13..> 30-May-2013 07:35  3.4M  

[TXT]

 Rues-Departements-17..> 30-May-2013 07:43  3.0M  

[TXT]

 Rues-Departements-21..> 30-May-2013 07:56  3.3M  

[TXT]

 Rues-Departements-25..> 30-May-2013 18:27  3.1M  

[TXT]

 Rues-Departements-29..> 30-May-2013 18:28  4.0M  

[TXT]

 Rues-Ecole.htm          03-Jun-2013 06:54  2.0M  

[TXT]

 Rues-Eglise.htm         03-Jun-2013 06:55  1.8M  

[TXT]

 Rues-Eugene.htm         30-May-2013 18:32  1.0M  

[TXT]

 Rues-Faubourg-1.htm     30-May-2013 18:35  601K  

[TXT]

 Rues-Ferme.htm          04-Jun-2013 13:02  2.7M  

[TXT]

 Rues-Gare-2.htm         03-Jun-2013 06:59  1.5M  

[TXT]

 Rues-Gare.htm           30-May-2013 18:33  615K  

[TXT]

 Rues-Gendarmerie.htm    30-May-2013 18:31  1.0M  

[TXT]

 Rues-General.htm        30-May-2013 18:30  1.3M  

[TXT]

 Rues-Grand.htm          26-Jun-2013 07:25  1.4M  

[TXT]

 Rues-Grande-Rue.htm     30-May-2013 18:31  1.0M  

[TXT]

 Rues-Hameau.htm         04-Jun-2013 12:52  2.7M  

[TXT]

 Rues-Haras.htm          26-Jun-2013 07:26  2.7M  

[TXT]

 Rues-Haut-2.htm         05-Jun-2013 16:32  2.7M  

[TXT]

 Rues-Haut.htm           04-Jun-2013 12:54  1.8M  

[TXT]

 Rues-Hotel.htm          03-Jun-2013 06:57  1.9M  

[TXT]

 Rues-Impasse.htm        30-May-2013 18:33  1.5M  

[TXT]

 Rues-Jardin.htm         03-Jun-2013 06:58  1.8M  

[TXT]

 Rues-Joseph.htm         05-Jun-2013 16:21  2.5M  

[TXT]

 Rues-La-Commence-par..> 26-Jun-2013 07:23  1.7M  

[TXT]

 Rues-Lac.htm            04-Jun-2013 12:59  2.5M  

[TXT]

 Rues-Le-Bois.htm        30-May-2013 18:32  639K  

[TXT]

 Rues-Le-Commence-par..> 26-Jun-2013 07:23  2.4M  

[TXT]

 Rues-Leclerc.htm        03-Jun-2013 06:59  1.8M  

[TXT]

 Rues-Lieu.htm           03-Jun-2013 06:57  2.1M  

[TXT]

 Rues-Lo-Commence-par..> 26-Jun-2013 07:27  2.3M  

[TXT]

 Rues-Lotissement.htm    04-Jun-2013 13:01  3.2M  

[TXT]

 Rues-Lu-Commence-par..> 26-Jun-2013 07:27  1.8M  

[TXT]

 Rues-Ma-Commence-par..> 26-Jun-2013 07:22  1.5M  

[TXT]

 Rues-Mai-Commence-pa..> 26-Jun-2013 07:22  2.5M  

[TXT]

 Rues-Mairie.htm         30-May-2013 18:31  1.0M  

[TXT]

 Rues-Maison.htm         03-Jun-2013 06:52  2.4M  

[TXT]

 Rues-Marais.htm         04-Jun-2013 12:59  2.5M  

[TXT]

 Rues-Marie.htm          04-Jun-2013 12:53  1.8M  

[TXT]

 Rues-Martin.htm         30-May-2013 18:39  2.8M  

[TXT]

 Rues-Me-Commence-par..> 26-Jun-2013 07:22  2.5M  

[TXT]

 Rues-Mi-Commence-par..> 26-Jun-2013 07:21  1.5M  

[TXT]

 Rues-Mo-Commence-par..> 26-Jun-2013 07:21  1.8M  

[TXT]

 Rues-Monde-Test.htm     03-Jun-2013 07:34  2.6M  

[TXT]

 Rues-Montagne.htm       04-Jun-2013 12:55  3.1M  

[TXT]

 Rues-Moulin.htm         04-Jun-2013 13:00  3.5M  

[TXT]

 Rues-Moustier.htm       05-Jun-2013 16:26  3.1M  

[TXT]

 Rues-Mu-Commence-par..> 26-Jun-2013 07:21  1.5M  

[TXT]

 Rues-My-Commence-par..> 26-Jun-2013 07:21  1.8M  

[TXT]

 Rues-Neuf.htm           05-Jun-2013 16:27  3.1M  

[TXT]

 Rues-Parc.htm           30-May-2013 18:29  2.4M  

[TXT]

 Rues-Passage.htm        03-Jun-2013 07:01  2.9M  

[TXT]

 Rues-Petit.htm          03-Jun-2013 06:56  2.7M  

[TXT]

 Rues-Pierre-2.htm       03-Jun-2013 07:00  1.7M  

[TXT]

 Rues-Pierre.htm         30-May-2013 18:33  830K  

[TXT]

 Rues-Place.htm          30-May-2013 18:32  1.2M  

[TXT]

 Rues-Plage.htm          05-Jun-2013 16:27  3.0M  

[TXT]

 Rues-Pont-2.htm         05-Jun-2013 16:20  2.9M  

[TXT]

 Rues-Pont.htm           03-Jun-2013 06:56  2.5M  

[TXT]

 Rues-Pre-Commence-pa..> 26-Jun-2013 07:23  2.4M  

[TXT]

 Rues-Promenade-1.htm    30-May-2013 18:35  603K  

[TXT]

 Rues-Puits.htm          04-Jun-2013 12:54  1.6M  

[TXT]

 Rues-Quai.htm           03-Jun-2013 06:53  2.3M  

[TXT]

 Rues-Quartier-2.htm     04-Jun-2013 12:53  2.3M  

[TXT]

 Rues-Quartier.htm       30-May-2013 18:32  1.0M  

[TXT]

 Rues-Residence.htm      30-May-2013 18:29  2.3M  

[TXT]

 Rues-Restaurant.htm     30-May-2013 18:33  619K  

[TXT]

 Rues-Roger.htm          05-Jun-2013 16:19  3.0M  

[TXT]

 Rues-Route-1.htm        30-May-2013 18:36  736K  

[TXT]

 Rues-Route-2.htm        30-May-2013 18:36  1.0M  

[TXT]

 Rues-Route-3.htm        30-May-2013 18:36  1.0M  

[TXT]

 Rues-Route-4.htm        05-Jun-2013 16:24  4.1M  

[TXT]

 Rues-Rue-1.htm          30-May-2013 18:34  1.0M  

[TXT]

 Rues-Rue-2.htm          30-May-2013 18:34  932K  

[TXT]

 Rues-Rue-3.htm          30-May-2013 18:34  1.0M  

[TXT]

 Rues-Rue-4.htm          30-May-2013 18:34  1.0M  

[TXT]

 Rues-Rue-5.htm          30-May-2013 18:34  1.0M  

[TXT]

 Rues-Rue-6.htm          30-May-2013 18:34  1.1M  

[TXT]

 Rues-Rue-7.htm          30-May-2013 18:33  1.1M  

[TXT]

 Rues-Rue.htm            04-Jun-2013 13:01  3.2M  

[TXT]

 Rues-SNCF.htm           03-Jun-2013 06:53  2.3M  

[TXT]

 Rues-Saint.htm          30-May-2013 18:32  1.2M  

[TXT]

 Rues-Sentier.htm        03-Jun-2013 06:58  1.5M  

[TXT]

 Rues-Square.htm         03-Jun-2013 06:55  2.7M  

[TXT]

 Rues-St-Commence-par..> 26-Jun-2013 07:24  1.6M  

[TXT]

 Rues-Stade.htm          04-Jun-2013 13:00  3.6M  

[TXT]

 Rues-TEST-78-Yveline..> 26-Jun-2013 10:21  1.5M  

[TXT]

 Rues-Terre.htm          04-Jun-2013 12:54  1.8M  

[TXT]

 Rues-Traverse.htm       05-Jun-2013 16:19  3.0M  

[TXT]

 Rues-Traversee.htm      05-Jun-2013 16:20  2.9M  

[TXT]

 Rues-Vert.htm           04-Jun-2013 12:58  2.6M  

[TXT]

 Rues-Vieil.htm          05-Jun-2013 16:22  2.3M  

[TXT]

 Rues-Vieux.htm          04-Jun-2013 16:00  2.8M  

[TXT]

 Rues-Villa-2.htm        05-Jun-2013 16:23  2.3M  

[TXT]

 Rues-Villa.htm          30-May-2013 18:30  1.8M  

[TXT]

 Rues-Village.htm        30-May-2013 18:31  968K  

[TXT]

 Rues-Ville.htm          30-May-2013 18:29  2.7M  

[TXT]

 Rues-Voie.htm           03-Jun-2013 07:01  2.9M  

[TXT]

 Rues-Vue.htm            04-Jun-2013 12:56  3.0M  

[TXT]

 Rues-Zone-Artisanale..> 30-May-2013 18:36  602K
Rues de France : Adresse ALEZ ALEZ AL LENN ALEZ AR GOSKER ALEZ GLAZ ALEZ GOZ KERGWENN ALEZ HIR ALEZ IZELA ALEZ IZELLA ALEZ KERBILIEZ ALEZ UHELA ABBE FRANZ STOCK ABER WRAC H ACHILLE BEL AJONCS ALAIN BOUCHART ALAIN COLAS ALBATROS ALEXANDRE DUMAS ALEXANDRE VERCHIN ALIZEES AMPERE ANATOLE FRANCE ANATOLE BRAZ ANDRE JARLAND ANEMONES ANGELA DUVAL ANGELE VANNIER ANNECY ANNE MESMEUR ANTOINE WATTEAU TRINITE ARCHIMEDE AR C HURE AR FAOU AR FOENNEG AR GALL ARGOAT ARMAND CHARPENTIER ARMAND ROBIN ARTHUR BORDERIE AUBEPINES AUGUSTE DUPOUY AUGUSTIN FRESNEL BECQUEREL BEG MENEZ BEL ABRI BELLEVUE BENIGUET BENJAMIN CONSTANT BERGERONNETTES BERLIOZ BERNACHES BERTRAND D ARGENTRE BOIS D AMOUR PINS NEUF QUENVEL ROCHE BONEZE BOULOUARN BRISE BRUYERES CALMETTE GUERIN CAMELIAS CAMILLE COROT CANAPE CAPITAINE JEANNAULT PEZENNEC CARN YAN CERISIERS CHAMONIX CHARDONNERETS GOFFIC PEGUY ROLLAND CHARMILLES CHARMILLES LANDE LOTHAN CHATAIGNERS CHATAIGNIERS CHAUMIERE FER CHENES CHENES VERTS CHEVREFEUILLES C HOAS LEORET CINQ CLAUDE DERVENN CLAUDE MONET CLAUDE PERRAULT COAT AN LEM COATANLEM COLONEL REMY COQS CORMORANS CORNOUAILLE COSMEUR COSQUER COUDRIERS COURLIS CREIS KADOR CYPRES CYTISES DA GUER D AQUITAINE BALANEC BOISEON PEN AR HOAT BREMILLIEC BROCELIANDE BUDE STRATTON DEBUSSY COAT AMOUR COAT CHAPEL COATELAN CORNOUAILLE COUBERTIN CROAS HENT GORREKEAR KERARBLEIS KER ARZEL KERAYEN KERBARS KER CERF KERDANIEL KERDEOZER KERDOUR KERGOLVEZ KERGOSTIOU KERGROACH KERHUEL KERIEL KERIGONAN KERINOU KERJEAN KERJEGU KERLAN VIAN KERLAN VRAS KERLEZANET KERLIEN KERLORET KERMARIA KERMOGUER KERMORVAN KERNENEZ KERNISY KEROUDOT KEROURVOIS KERSALE KERUSTUM KERVALLAN KERVARLAES KERVERN KERVIGNAC KERVOALIC L ABER BENOIT L ABER ILDUT L ABER WRACH CHAPELLE CHAPELLE NEUVE COUDRAIE CROIX DAME CARDE FEE VIVIANE FONTAINE FONTAINE AUX ANGLAIS GUADELOUPE GUYANE L ALSACE LANADAN LANNEVEL LANNIRON PENFELD PLEIADE LARC HANTEL REUNION L ARGOAT ROCHE PERCEE LAVALOT VOIE ROMAINE L ELORN LENHESQ L ETANG L HERMITAGE L IRLANDE L IROISE L ISOLE LOCAL HILAIRE L ODET LOSSULIEN MANAR LAK MENEZ KERVEADY MEZ YAN DEMOISELLES MOLENE NAVALHARS DENIS PAPIN DENTELLIERE PARK AN TI PENANCREACH PEN AR C HOAT PEN AR GUER PEN AR MENEZ PEN AR PEN AVEL PEN ERGUE PENFOULIC PENHOAT PENNERVAN PEN RUIC PRAT AR ROUZ PRAT AR ROZ PRAT GUEN PRAT HIR PROVENCE ROZARGUER ROZ AVEL 4 VENTS ABERS ACACIAS AIGRETTES EUGENE NEVEZ AJONCS AJONCS D OR ALIZES ALOUETTES AUBEPINES AVOCETTES BERGERONNETTES BERGERS BOULISTES BOUVREUILS BRUYERES CAMELIAS CEDRES CELTES CERISES CERISIERS CHAPERONS CHARMILLES CHATAIGNIERS CHENES CIGOGNES COLOMBES COQUILLAGES CORMIERS CORMORANS COURLIS CYPRES DAHLIAS DEPORTES ECUREUILS ECUYERS EGLANTINES EIDERS ENCLOS ERABLES ETANGS KERUZAS FAUVETTES FILETS BLEUS FLEURS D AJONCS FONTAINES FRAISES FRENES FRERES LUMIERE FUCHSIAS FUSAINS GENETS GENETS D OR GERANIUMS GLENANS GLYCINES GOELANDS GRANDS SABLES GROSEILLES GUILLEMOTS HETRES HIRONDELLES HORTENSIAS HOUX IRIS JARDINS JASMINS JONQUILLES LAURIERS LILAS LYS MACAREUX MAGNOLIAS MANDARINS MARAICHERS MARRONNIERS MESANGES MIMOSAS MOUETTES MUGUETS MURES MYRTILLES NARCISSES NEFLES NOISETIERS ORCHIDEES ORMES PECHEURS PECHEUSES PENSEES PERVENCHES PETITS CHENES PETUNIAS PEUPLIERS PINS PINSONS PIVERTS PLATANES POMMIERS PRIMEVERES PRUNELLES PRUNIERS RAINETTES ROCHES BLANCHES ROITELETS ROMARINS ROSES SAPINS SAUGES SOEURS SIBIRIL SOUPIRS SOURCES STERNES SYCOMORES TAMARIS STANG ROZ STANG YOUENN STANG ZU STELLE THUYAS TILLEULS TROENES TROIS TULIPES VERDIERS VIOLETTES TREGOUZEL TREMARIA TREOUZON VIHAN TREQUEFFELEC DEUX PLAGES VERLEDAN ARISTIDE PILVEN LUCAS ROYER DOUR 19 MARS 1962 DU BANELLOU BELIER BOIS QUENVEL BOT BOURDONNEL CANDY CANIK AR HARO FER CHEVREFEUILLE DUC HOEL GUILLY LARGE MAINE MESTO MUGUET DUNES PARC BRAZ PARC HUELLA PEN DUICK PERIGORD PETIT KERVAO MANOIR PETIT PARIS PHARE POAN BEN PONTOIS PRADIC PRESIDENT ROOSEVELT PUITS ROUZ CREIS RUISSEAU STADE STANCOU STANG STANGALA STIFF THYM TROMEUR VALLON MINOU VERGER EDGAR DEGAS EGLANTINE ELORN EMBRUNS ERISPOE ERNEST L ECLUSE ERNEST PSICHARI ERNEST RENAN FEUNTEUN SANE FILANDIERES FILOMENA CADORET FLEURS FONTAINE FORGES ARAGO BUZOT COPPEE DUINE KERBOURCH MAURIAC MENEZ FREDERIC JOLIOT FREDERIC GUYADER FREGATES FRERES CLEMENT FROMVEUR GALLIENI GASTON ESNAULT G BRIOT MALLERIE GENETS GENETS LOCQUERAN MACE GOAREM BIHAN GOAREM PIN GOELANDS GOUNOD GOZ G POITOU DUPLESSY GRAINVILLE GUESTEN GUIAUTEC APOLLINAIRE BUYS GUSTAVE COURBET GWELL KAER TREBEUZEC H BOURDE ROGERIE HELENE BOUCHER HENRI WACQUET HENRY CHATELIER HERONS HIRONDELLES HORTENSIAS IRIS IROISE IZELLA JACQUES PREVERT JAKEZ RIOU JARDINS JARL PRIEL BAPTISTE CHARCOT BAPTISTE LOUVET BOUIN DONNARD MILLET JULIEN LEMORDANT LAGADIC CORRE LOUIS CHUTO DEGUINET MESCHINOT MORAND JEANNE FLAMME JEANNE VALMIER QUEIGNEC YVES LEVENES JEFF PENVEN J L G NAOUR J L GREVELLEC JONQUILLES JOSE MARTINACHE JOSEPH GAY LUSSAC KERADEN KERALLAIN KERANGUDEN KERAUDREN KERAUTRET KERAVILIN BIAN KERBLEUNIOU KER BREACH KER EOL KERGALY KERGONDA KERGOZ KERINER KERIVOT KERJEAN KERJEROME KERLIDOU KERMARC KERMOOR KERSALE KERSALE D EN HAUT KERVILLORE KREIZKER L BOUGAINVILLE LAE LAGADENOU LAM AR ZANT LAND LANN ROCHEFOUCAULT LAURIERS LAVANDIERES LAVIGERIE LAVOISIER LEA LEACH LEAC LILIA LEACH LILIA LEIGN LEN LENN LEO LEON LERN LESTONAN VRAZ LEUN LEUR LEURE LEURIOU LEURVEAN LIA LIJOU LILAS LIORZHOU LIORZIOU LIORZOU LIOU LOAR RUZ KERMABILOU LOCH LOEIZ AR FLOCH LOUARN LOUC H LOUIS BOUGUENNEC LOUIS FEUNTEUN LOUIS GUILLOUX LOUIS HEMON LOUIS OGES LOUIS PASTEUR LUCIEN SIMON LUDU MADAME SEVIGNE MADAME NESTOUR MANOIR MARAICHERS MARCEL PAGNOL MARGUERITES PAULE SALONNE MARONNIERS MARRONNIERS MARTILIN AN DALL MARYSE BASTIE MAURICE BON MAX JACOB MECHOU GOAREM MEILH GLAZ MEILH STANG VIHAN MENEZ MENEZ GOUERON MENEZ KADOR MENEZ TRAON MEN FOUES MENGLEUZ ROUZ MESANGES MESCANTON M FRIAND MEUNIER MICHAEL FARADAY MICHEL JAOUEN MICHEL JULIEN MIMOSAS MOLENE MORVAN LEBESQUE MOUETTES MOZART MUGUET NATHALIE LEMEL NAVALHARS NAVIGATEURS NEIZ KAOUEN NEUCHATEL NEUVE NOEL ROQUEVERT NOISETIERS NOROIT OISEAUX OLIVIER SOUVESTRE ONDINES OSISMES PARC PARC AR GROAS PARC AR VILIN PARK AN ABER PARK BIHAN PARK BRAZ PARK LANN PARK LENDU CLAUDEL GAUGUIN LANGEVIN VERLAINE PECHEURS PELICANS PEN AN TREZ PEN AR GUER PEN AR HAN PEN AR MENEZ PEN AR STREJOU PEN AR VALLY PEN AR VALY PEN AR VERN PENLAN PENNANEAC H PENNKER PER JAKEZ HELIAS PERVENCHES PETRELS PEUPLIERS PHARE BLAYAU CURIE DAMALIX COUBERTIN LEPINE LOUET PINS PLUVIERS POMMIERS PONTICOU PONTIGOU ROMAIN PORSMEUR POTAGERS POULDIGUY PRAT PRAT AR FEUNTEUN PRAT AR ZARP PRAT COULM PRESIDENT WILSON PRIMEVERES PROSPER PROUX PRUNELLES QUATRE VENTS QUINQUIS RAYMOND CANVEL RENE MEN RENE GUY CADOU R GEN PENFENTENYO ROLAND DORE ROLAND DORGELES RONSARD ROSA FLOCH ROSCOGOZ ROSEAUX ROSEAUX TREZ HIR ROSERAIE ROSIERS ROSMEUR ROSSINI ROUZ ROZ AN AODIG ROZ AVEL GENERAL PENFENTENYO KREIS RUPOEZ ARMEL CLET CLOUD SAINTE ANNE SAINTE CROIX SAINTE THUMETTE GILDAS GUENOLE GURLOES SAINT LUC MALO POL ROUX PRIMEL ROCH ROCK RONAN TUGDUAL URSULE VALENTIN SAMUEL CHAMPLAIN SAULES STADE STANG ZU STEIR STIFF SUFFREN SULLY SURCOUF SUROIT TAMARIS THEODORE BOTREL THEODORE VILLEMARQUE THORENS TI LIPIG TILLEULS TORRENS TOULIFO TOULIGUIN TOULOUSE LAUTREC TOURIGOU HUELLA TOURNE PIERRES TOURTERELLES TREBEHORET TREGOR TRIELEN TRISTAN CORBIERE TROENES TROUIDY TULIPES TY BOUT TY NEVEZ U JOSEPH COUCHOUREN VANNETAIS VENELLE VERTE VIAN KERDEACH VIBERT 1 VIBERT VICTOR SALEZ VICTOR SEGALEN VILLIERS L ISLE ADAM VILLOURY VINCENT VAN GOGH VIVALDI VOLTAIRE VRAS VRAS LAMBELL WATTEAU W CATHERINE BOOTH XAVIER GRALL YANN SOHIER YOUEN DREZEN YVES ELLEOUET YVES AUDREIN ALLEGOT ALLEGUENNOU ALLEGUENOU AMARINE KERSALIOU AMIRAL TRAP AN ALE BIHAN AN ALE VRAS AN ALLE VRAS AN ARAGON AN ATANT GOZ TOULIGUIN ANCIENNE ANCIENNE GUILLY ANCIENNE PRIVE KERGROES ANCIEN PRESBYTERE ANCIEN PRESBYTERE BOURG ANCIEN PRESBYTERE MENEZ STEUD AN DACHENNIG AN DAOU BARZ AN DIRZO HENT AN DISKUIZ AN DISTRO RHUN AN DIVISION AN DOURDU AN DOURIG AN DREINDED AN DREUZEC KERBEURNES AN DURZHUNELL TI GARDE TREVOURDA AN ENEZ ANGLE AN HENT COZ AN ILISVEN AN ILIZ VENN AN OALEJOU AN ODE BRI AN ODE WENN ANSE KERJEGU ANSE KERAMBACCON ANSE TY MARK COSQUER ANSE ROHOU ANSE ROSPICO ANSE ROZ ANSE GILDAS ANSE LAURENT ANTER AR VALY ANTEREN ANTER HENT AN TI GWER RUVEIC AN TRAON AN TREIZ AR BALAN AOUR AR BARADOS STANG AR WENNEG AR BARADOZ AR BAR HIR ARBRE CHAPON AR C AR CHERIGOU BIHAN AR C HEUN ARC HOAT AR COSQUER AR CREAC H ARDOISIERE KERMANACH AR DREO AR FELL AR FOTOU BIHAN AR FOTOU VRAS AR GEBOG ARGENTON 10 STREAT TOUL AR LIN ARGENTON 11 LANHALLES ARGENTON 11 STREAT TOUL AR LIN ARGENTON 12 LANHALLES ARGENTON 12 STREAT AR ARGENTON 12 STREAT TOUL AR LIN ARGENTON 13 LANHALLES ARGENTON 14 VERLEN ARGENTON 15 GORREQUEAR ARGENTON 15 LANHALLES ARGENTON 16 HENT SANT GONVEL ARGENTON 17 HENT AOD PENFOUL ARGENTON 17 HENT AOD VERLEN ARGENTON 18 HENT AOD VERLEN ARGENTON 18 VERLEN ARGENTON 1 GORREQUEAR ARGENTON 1 HENT AOD GWEN TREZ ARGENTON 23 HENT AOD VERLEN ARGENTON 24 GONVELD ARGENTON 25 ARGENTON 26 LANHALLES ARGENTON 27 ARGENTON 27 KERRIOU ARGENTON ARGENTON 2 GORREQUEAR ARGENTON 2 HENT SANT GONVEL ARGENTON 2 KERLEGUER ARGENTON 2 LANHALLES ARGENTON 36 KERRIOU ARGENTON 38 LANHALLES ARGENTON 4 STREAT PRAT AR C ARGENTON 4 STREAT TOUL AR RAN ARGENTON 5 HENT SANT GONVEL ARGENTON 5 KERRIOU ARGENTON 6 STREAT AR ARGENTON 7 HENT AOD VERLEN ARGENTON 7 STREAT TOUL AR LIN ARGENTON 7 STREAT TOUL AR RAN ARGENTON 8 VIVIER ARGENTON 8 STREAT TOUL AR RAN ARGENTON 9 STREAT TOUL AR LIN ARGENTON VIVIER ARGENTON GONVELD ARGENTON VERLEN AR GLOANEG AR GOAREM GOUR AR GOAREN KEROZAL AR GOUDORIC AR GOUENT AR GOZKER AR GUILI AR GUILI HERVE AR GUILY AR HOAD ARHOAT AR HOAT AR MAJENN AR MEAN AR MENEZ AR MENEZIG KEROUANT AR MENNONT AR MEN TOUL KERDOUALEN AR MERDI AR MILIN COZ KERGLIEN AR MOR RUST AR PALUD AR PONTIG AR POULL AR REUNGOL AR ROUDOUR AR ROUZ AR ROZ COZ AR ROZ NEVEZ AR RUGELL ARSENAL AR STIVELL AR STOUNK AR STYVEL AR SULIAO AR VANEL AR VARREC AR VEIN GLAS AR VENEC AR VENGLEUZ AR VERN AR VERNIG AR VEROURI AR VEROURI NEVEZ AR VEROURY NEVEZ AR VEVID AR VILLASE AR VOAREM AR VOURC H AR VRENNEC AR VRENNIG AR VUJID VRAS AR WOAREM AR YUN AR ZAL AR ZALOU ASKEL ATELIER KERAVAL T KERNEVEN ATTILOU AUBERGE NEUVE AULNAYS AUX QUATRE AUX QUATRE VENTS AVEL AR MENEZ AVEL CORNOG GOAREM KERMADORET AVEL DRO AVEL HUELLA KERHUELLA AVEL MAD AVEL MENEZ AVEL MOOR CRINQUELLIC VIAN AVEL MOR AVEL MOR KERSCOUARNEC AVEL VOR AVEL VOR KERHORNOU 1 ERE DFL 8 MAI 1945 GANTIER ALAIN LAY ALEXIS ROCHON AMIRAL REVEILLERE ANNE BRETAGNE ANSE PENFOUL ARISTIDE BRIAND ARMAND PEUGEOT AUGUSTE GANTIER BEL AIR D ISIS BOUILLOUX LAFONT BUSUM CALLINGTON CAMILLE DESMOULINS CHAPELLE FOUCAULD CHARLES TILLON COATMEZ BANTRY BECHARLES BIELEFELD SENNE BRETAGNE COAT KAER COATMEUR COAT MEZ CORAY GARSALEC KERADENNEC KERARTHUR KERBONNE KERDREZEC KERGOAT AL LEZ KERHUEL KERNEGUES KERRIEN KERVEGUEN KERVOALIC CASCADE FRANCE LIBRE LIBERATION MER PLAGE PLAGE GUEUX POINTE REPUBLIQUE RESISTANCE L NAVALE LIMERICK L OCEAN MENEZ BIHAN NORMANDIE PEN CARN LESTONAN PROVENCE QUIMPER REMSCHEID BOUVREUILS CARMES CASTORS CHASSEURS SCHLEIDEN COLS VERTS CORMORANS DUNES ECUREUILS GIRONDINS GLENAN GOELANDS SKIBBEREEN OISEAUX PRES SAULES SPORTS TOURTERELLES TALLINN TARENTE TI DOUAR TI TREBEHORET TRURO TY BOS WALDKAPPEL DIGUE BARON LACROSSE D ISIS BRADEN CABELLOU DUCHESSE ANNE CLAIR LOGIS CORNIGUEL DORLETT DOURIC GUERDY GUIRIC LEZARDEAU LYCEE MARECHAL FOCH COQ POLYGONE POULDUIC PRESIDENT ALLENDE PRESIDENT ROBERT SCHUMAN ROUILLEN SAULE SILLON TECHNOPOLE TEVEN TOUROUS ERIC TABARLY FOUESNANT FRANCAIS LIBRES GENERAL BAIL BRASSENS CLEMENCEAU BAIL POMPIDOU GHILINO G BAIL HENT GLAZ JACQUES VIOL BERNARD DESSAUX JAURES DANIEL JEANNE D ARC JOHN KENNEDY JULES FERRY KERCREVEN KERDANET KERDIVICHEN KERINCUFF KERISTIN KER IZELLA KERLECH KERLECK TOUR D AUVERGNE LEON BLUM LESTONAN LOUIS LEZ LOUISON BOBET MANU BRUSQ MARECHAL FOCH MARECHAL MARGUERITE FAYOU MATHIEU DONNART MER MIOSSEC OCEAN ODET PABLO NERUDA GUEGUIN JACKEZ HELIAS MENDES FRANCE PLAGE PORT PORTSALL PREMIER MAITRE L HER QUELEREN RENE COADOU REPUBLIQUE ROBERT JAN ROBERT SCHUMANN DENIS JOSEPH SALVADOR ALLENDE SCHUMANN TALADERCH TEVEN TOUL AN TOUR D AUVERGNE VICTOR HUGO VICTOR GORGEU WALTENHOFENN WALTENHOFFEN WURSELEN YVES THEPOT BACON BAIE TREPASSES BAIE TREPASSES ROZ VEUR BAIE PENFOUL BAILLAOU BALAENNOU BALAENOU BALAN AR GOFF BALANDREO BALANEC BALANEC HUELLA BALANEG BALANEIR BALANOU BALAREN BALIALEC BALY PLUFERN BALY VERN BANALEC AR LOUET BANALOU BAND BANDELLOU BANELL AL LENN BANELL AR GROAZ BANELL AR MERC HED BANELL DALL BANIGUEL BANINE BANTHOU BARADOS BARADOZ BARADOZIC BARADOZIG BARBARY BARBOA BARBU BARENNOU BARGUET BARHIR BARNAO BARNAOU BARNENEZ BARONNOU BAROUZ BARRACHOU BARR AVEL BARRE BARRE NEVEZ BARRIERE CROIX BARRIERE CROIX BARRIERE ROUGE BARVEDEL BASCAM BAS BAS BOURG BASINIC BASSIN 9 MARINE LANINON BASSINIC DANIEL KERLEGAN KERLEGAN KERSEGALOU COAT GRANGE ROCHE D ELLIANT PINS PLEUVEN DUC DUC RELECQ GARIN JAFFRAY KERDALEM KERGOAT DUC NEUF PLEUVEN GERMAIN BOISSIERE BOLAST BOLAZEC VIAN BOLE BOLEDER BOLORE BON COIN BONNE CHANCE BONNE NOUVELLE BONNE RENCONTRE BONNIOUL BON PLAISIR BON REPOS BONTUL BORCH LESTREQUEZ BORDENEN BORHOU BIHAN BORODOU BOSCADEC BOSCAO BOSSAVARN BOSSULAN BOTANIEC BOTAVAL BOT AVAL BOT BALAN BOTBALAN BOT BALAN KER ANNA BOTBEGUEN BOTBERN BOTBIAN BOTBIHAN BOTBODERN BOTCABEUR BOTCADOR BOT CARREC BOT CARREC IZELLA BOTCOAT BOTDOA BOTDREIN BOT DREIN BOTDUEL BOTEDEN BOTFAO BOT FAO BOTFAVEN BOTFORN BOTFRANC BOT GUEZ BOTHALEC BOTHENEZ BOTHUAN BOTIGNERY BOTLAN BOTLANN BOTLAN MATHIEU BOTLAVAN BOTLENAT BOTLOVAN BOTMEUR BOTMEZER BOT ONN BOT PIN BOT QUELEN BOTQUELEN BOTQUENAL BOTQUEST BOTQUIGNAN BOT RADEN BOT REPOS BOTREVY BOTSAND BOTSPERN BOTTREIN BOTVELLEC BOUCHEOZEN BOUDIC BOUDIGOU PEAR PEN AR MENEZ BOUDIGUEN BOUDOU BOUDOUGUEN BOUDOULAN BOUDOULAND BOUDOUREC BOUDRACH BOUDUEL BOUED MANOIR KERSKLOEDENN BOUERES BOUGES BOUGOUGNES BOUGOUROUAN BOUILLARD BOUILLEN BOUILLEN HOZ BOUILLEN AR HOZ BOUILLEN BRAS BOUILLENNOU BOUILLENNOU TREGONDERN BOUILLEN VIAN BOUIS BOULACH BOULAIE BOULAOUIC BOULARD GUILLOU GUILLOU AMIRAL KERGUELEN BOUGAINVILLE CAMILLE REAUD CHATEAUBRIAND CLEMENCEAU COMMANDANT MOUCHOTTE CORNEILLE CORNICHE CORNICHE TREZ HIR CORNICHE TREZ HIR BRETAGNE COATAUDON CREAC GWEN KERNEUZEC KERVEGUEN FRANCE LIBRE LAITA PLAGE REPUBLIQUE L EUROPE L OCEAN PLYMOUTH PROVENCE ACACIAS FRANCAIS LIBRES FRERES MAILLET SLIGO MYOSOTIS POILUS D ESTIENNE D ORVES DUPLEIX PONANT EUROPE GAMBETTA GENERAL ISIDORE MARFILLE RICHEPIN KATERINE WYLIE FONTAINE LAITA LEON BLUM LEOPOLD MAISSIN MARECHAL MARINE MAR MENDES FRANCE MER MER TREZ HIR MICHEL BRIANT MOLIERE MONTAIGNE OCEAN PLAGE PLAGES ROBERT SURCOUF SAINTE BARBE MARTIN TANGUY PRIGENT THIERRY D ARGENLIEU VICTOR HUGO VIOLLARD YVES NORMAND YVES NORMANT BOULFRET BOULLAC H BOULLACH BOULLAOUIC BOULLEN BOULOUZARD BOULVA BOULVAS BOULVERN BOULWENN BOUQUIDIC BOURAPA BOURAPPA BOURBONNAISE TY NEVEZ KERVENAL BOURDEL BOURDIDEL BOURETTE BOURG 11 LEUR SANT MERYNN BOURG BOURGADEN BOURG BERRIEN BOURG RUMENGOL BOURG GROUANEC BOURG LAMBER BOURG LAMPAUL BOURG LILIA BOURG LOGONNA BOURG NEUF BOURGNEUF BOURG NEVEZ IZELLA BOURG ROUTE FOUESNANT BOURG ROUTE POULDREUZIC BOURG ROUTE QUIMPER BOURG CADOU BOURG TI BRUG BOURG TREZIEN BOURLACH BOURLAND BOURNAZOU BOUROUGUEL BOUROUILLES BOURRET BOUT BOUTEFELEC BOUTIQUE BOUTOIGNON BOUTREC BOUTROUILLE BOUTROUILLES BOUYOUNOU BOZENOC CRUGUEL BRAMOULLE BRANCOU BRAS BRANCOU BRAZ BRANDERIEN BRANILIEC BREC HLEUZ BREC HOUEL BREGALOR BREGOULOU BREHARADEC BREHARN BREHELIEC BREHELLEN BREHEUNIEN BIAN BREHEUNIEN BRAS BREHICHEN BREHOAT BREHOAT KERDEZPET BREHONNET BREHORAY BREHOSTOU BREHOUNIC BREIGNOU BREIGNOU COZ BRELEIS BRELEVENEZ BREMEL BREMELLEC BREMEUR BREMILLEC BREMILLIEC BREMOGUER BREMPHUEZ BRENANVEC BRENANVEC NEVEZ BRENAVALAN BRENAVELEC BRENDAOUEZ BRENDEGUE BRENDEGUE BIHAN BRENDEGUE BRAS BRENELEC BRENELIO BRENENTEC BRENEOL BRENESQUEN BRENGOEL BRENGOULOU BRENGURUST BRENINGANT HUELLA BRENIZENEC BRENIZENNEC BRENN BRENNANVEC BRENNANVEC NEVEZ BRENNTENOES BRENOT BRENOT VIAN BRENOT VIAN TY BIAN BRENTERC H BRENUMERE BRENVARCH BRESLAU BRETIEZ BRETIN BRETOUARE BRETTIN BREUGNOU BREUGUNTUN BREUIL BREUIL BRAS BREUNEN BREUNTEUNVEZ BREVENTEC BREZAL BREZECHEN BREZEHAN BREZEHANT BREZEHEN BREZEUHEN BREZOULOUS BRIAC VIAN BRIAC VRAS BRIANTEL BRIDEN BRIEC BIHAN BRIEC BRAS BRIELLEC BRIEMEN BRIGNEAU KERABAS BRIGNEAU KERNON ARMOR BRIGNEAU KERVETOT BRIGNEAU KERZIOU BRIGNEAU MALACHAPPE BRIGNEAU TEMPLE BRIGNEAU TRELAZEC BRIGNEOCH BRIGNEUN BRIGNIOU BRIGNON BRIGNOU BRIGOULAER BRINGALL BRINGALL HUELLA BRINGALL IZELLA BRIOU BRIOU HUELLA BRISCOUL BRISCOUL HUELLA BROAL BROC HET BROENNIC BROENNOU BROGADEON BROGARONNEC BROGODONOU BROGORONEC BROHEON BROMEUR BROMINOU VRAS BRONDUSVAL BRONE BRONENOU BRONNOLO BRONNUEL BRONOLO BRONUA BROUSTOU BRUC BRUCOU BRUG AN IZEL BRUG BRUGOU BRUGUET BRUKIGER 54 HENT KERMINALOU BRULERIE BRULUEC BRULY BRUMPHUEZ BRUMPHUEZ HUELLA BRUMPHUEZ IZELLA BRUNEC BRUNGUEN BRUNGUEN COZ BRUNGUENNEC BRUNOC BRUYERE BRUYERES BUDOU BUELHARS BUGNET BUHORS BUISSONNETS BULHARS BULLIEN BUNTOU BUORS BUORS VRAS BURDUEL BURLEO BUTOU BUTTE CHEVAL BUZEDO BUZIDAN BUZIT BUZUDEL BUZUDELL VIAN BUZUDIC BUZUEC CABALAN CABARET CABEL CABEL AR RUN CABEURIC CABOUSSEL CADIGOU CADIGUE BIAN CADIGUE BRAS CADOL CADOL INFIRMERIE CADOL KERGOURLAOUEN CADORAN CAELEN CAERO CALAFRES CALE RHUN PREDOU CALETOUR CALIFORN CALLAC CALVIGNE CAMASQUEL CAMBERGOT CAMBLAN CAMBLANC CAMBLAN CREIS CAMBLAN PELLA CAMBLAN STIFFEL CAMEAN CAMEAN BRAS CAMEAN HUELLA CAMEAN IZELLA CAMEAN VIHAN CAMEN CAMEN BIHAN CAMEN BRAS CAMEROS CAMEZEN CAMEZEN VIAN CAMFROUT CAMHARS CAMIOT CAM LOUIS CAMP CAMPAOL CAMPING ATLANTIQUE HENT CAMPING KERANTEREC CAMPING LANHOUARNEC STE ANNE CAMPING BRUYERES KERGOLLOT CAMPING VERT PRATULO CAMPING LAURENT CAMPIR CAMPIR SOUL CAMP KERNEDER CAMPOUL CAMPY CAMUEL CAMUEL LILIA CANAL AN OT CANAPE CANASTEL CANDI BIZOUARN CANHIR CANN CANN BRAZ CANSEAC H CANTEL CAON CAOUET CAOUT CAOUT VIAN CAOUT VRAS CAP COZ 25 CAP COZ 4 HENT TREUZ CAP COZ 6 HENT TREUZ CAP COZ 7 HENT TREUZ CAP CHEVRE CARAES CARANDAES CARBON CARIC CARLAY CARMAN CARMAN COZ CARN CARN AN DUC CARN AR HOAT CARN AR MENEZ CARN AR STER CARN AR VERN CARNAVERN CARN BIAN CARN DAOULAS CARN EN DUC CARNEO CARN HIR CARN LAER CARN LOUARN CARNOEN CARN PALUD CARN YANN CAROFF CARONT GLAZ CARONT LUTIN CAROS COMBOUT CARPONT CARPONT BIAN CARREFOUR BRAS CARREFOUR POHER CARREFOURLA CHATAIGNERAIE 60 HENT ROAZHON CARREFOUR TROIS CURES CARRIC CARRIE CARRIERE BLEUE CARRIERE SABLE GORRE BEUZEC CARRIERE KERMAO CARRIERES COSQUER CARRONT GLAS CARROS COMBOUT CARROS PENDU CASAVOYEN CASCADEC CASCADEC NEUZIOU CASERNE CASTEL CASTEL AL LEZ CASTEL AN DOUR CASTEL ANTER CASTEL AN TOUR CASTEL AR BIC CASTEL AR MEUR CASTEL AR RAN CASTEL BOCH CASTEL CORN CASTEL CORN ISTREVET AR BARANEZ CASTEL COUDIEC CASTEL DON CASTEL DOUN CASTEL DOUR LANFEUST CASTEL CASTEL DUFF CASTEL GOURANIC CASTEL GOURANNIC CASTEL GUEN CASTEL HELOU CASTEL HUEL CASTEL KERMOUSTER CASTELLENNEC CASTELLER CASTELLERE CASTELLIC CASTELLIEN CASTELLIOUAS CASTELLOROUP CASTELLOU CASTEL LOUET CASTELLOUROP CASTEL MEAN CASTELMEN CASTELMEUR BIHAN CASTEL NEVEZ CASTEL NEVEZ KEROULIN CASTEL PARIS CASTEL PIK CASTEL PRY CASTEL RUN CASTEL VIAN CASTRELLEN CATELINER CATELOUARN CATHELINER CATHELOUARN CAVARNO CAZIN D AN NEC H CELAN CELERIOU CENTAUREES CENTRALE E D F COMMERCIAL BRETAGNIA COMMERCIAL FONTAINES COMMERCIAL KERMOYSAN EQUESTRE KERVERGAR EQUESTRE MINE RADIO MARITIME CES KERVIHAN CHALET 4 KERDANIOU CHAMP CHAMP COURSE PEN ALAN CHAMP TIR PEN ALLEN CHAMP RIVE CHAMPS AJONCS CHANDOCAR COAT BEUZ HUEL CHANSONNIERE CHANT L ALOUETTE CHAP CHAPEL CHRIST CHAPEL GUILERS CHAPELLE CHAPELLE CHRIST CHAPELLE BEUZEC PORS CHAPELLE MUR CHAPELLE MUR TOULGOAT CHAPELLE GUILERS CHAPELLE JESUS CHAPELLE LAMBADER CHAPELLE MUR CHAPELLENDY CHAPELLE NEUVE CHAPELLE PENHORS CHAPELLE POL CHAPELLE SAINTE GERTRUDE CHAPPELLENDY CHATAIGNERAIE MINE CHAT CHEFFONTAINES CHAT KERMORVAN CHATEAU BOISEON CHEFFONTAINES KERUZORET MUR CHATEAUFUR GALL GAUTIER GUILGUIFFIN KERESTAT KERLAUDY KERUSCAR LESNEVAR LESNEVAR LESNEVAR LESQUIVIT LESVEN CHATEAULORE MESQUEON NEUF CHATEAUNEUF PENHOAT PENHOAT PLACAMEN QUELENNEC QUELEREN ROC HOU ROZ TREVIGNON TROHANET CHATEL CHAT KERAZAN CHAT KERLAUDY CHAT KERMINY CHAT LAZ CHAT QUIMERCH CHATRE CHAT TREMAREC CHAUSSEE COSQUER CHAUSSEE FROUT CHAUSSEPIERRE CHEFFREN CHELVEST CHEM CAP CHEM CREACH AN ALEE CHEM DOURIGOU CHEM DOURIGOU 7 ALBATROS AMERS ANSE STYVEL AR GWALARN AR SKLUZ ASTEL BAIE BEG BEG AN DUCHEN BEG AVEL BEG KERVOALIC BEL AIR BELLEVUE BENIGUET BERNARD RIVIERE BONNE NOUVELLE BOSQUETS BRANDEL BREMINOU BRUYERES CABLES CAP CAPUCINES CARN ZU CASTEL AN DOUR NAUTIQUE CHAPELLE LANVOY CHAPELLE ROCH CHATAIGNIERS CHERIGOU CLAIREFONTAINE COAT HIR COAT QUINTOU COLLINES COLOMBIER COLONIE COMMUNAUX COMPERE CORNICHE COSQUER COTIER KERSIDAN COZ FEUNTEUN CREACH VEIL CREISANGUER CREUX CREUX BRUGOU CRINQUELLIC CROAS VER CRUGUEL DAMES BEG AN DUCHEN BEG AR MENEZ BOT CONAN BRUNGUEN CLEMENTEC CLEUSMEUR COAT BEUZ COAT BILY COAT CONQ COAT FEUNTEUN COAT GOAREM COAT LIGAVAN COAT NESCOP COAT OLIER COAT PEHEN COAT TAN COSFORNOU CREAC HARO CREAC IBIL CREACH AN ALE FEUNTEUN NEVEZ KADOR GARENNE GLAZ GARONT AR VESTEL GOAREM AR WERN GROAZ CAER HELLES KERAERON KERAFLOCH KERALEC KERALIES KERAMBARS KERAMBIGORN KERAMBRETON KERAMOIGN KERAMPENNEC KERAMPENNOU KERAMPLEIN KERAMPORIEL KERAMPUS KERANCLOAREC KERANCORDENER KERANGAL KERANGOASQUIN KERANNA KERARGONT KERARNOU KERASQUEL KERASTEL KERASTEL MONTAGNE KERASTROBEL KERATRY KERAUDREN KERAVAL KERBELLAY KERBEN KERBIDEAU KERBIETA KERBIGUET KERBIRIOU KERBORONE KERBOYER KERCARADEC KERC HOAT KERDALAES KERDANIEL KERDARIDEC KER DERO KERDOHAL KERDREVEL KERDRONIOU KERELLEC KEREM KEREQUELLOU KERESSEIS KEREVAL KEREZOUN KERFEST KERGADOU KERGALL KERGARADEC KERGAZEGAN KERGOFF KERGOGNE KERGONAN KERGOULINET KERGRAVIER KERGREIS KERGRENN KERGUEN KERGUESTEN KERHO KERHUEL KERHUELLA KERICUFF KERIDORET KERILIN KERINVEL KERISIT KERIVIN VAO KERLAERON KERLAGATU KERLATEN KERLEAN KERLEDAN KERLEGUER VIAN KERLIC KERLIEN KERLINOU KERLIVIDIC KERLOCH KERLOSQUEN KERLOSSOUARN KERMADEC KERMAHONNET KERMARRON HUELLA KERMARRON IZELLA KERMENGUY KERMERRIEN KERMINGHAM KERNADIA KERNEEN KERNEUC KERNIVINAN KERNOACH KERNOTEN KERNOTER KERNOUS KERNOUS KERNUGEN KERNUZ KEROBIN KEROMEN KERORGANT KEROURIAT KEROURIEN KEROUSTILLIC KEROUZEL KERPOL KERREQUEL KERREUN KERRIOU KERROLAND KERROUE KERSALIOU KERSALOMON KERSAUX KERSCAO KERSIGON KERSTRAT KERSUTE KERVAILLANT KERVALGUEN KERVAOTER KERVASTARD KERVEN KERVENNEC KERVENT KERVESCAR KERVEUR KERVEZ KERVICHARD KERVIEL KERVIGNAC KERVIGOT KERVIGOU KERVILLERM KERVIR KERVOURZEC KERVOUYEC KERVOUYEC NEVEZ KERVROACH KERZENNIEL BUTTE CASCADE DIGUE FERME KERGARIOU FONTAINE GARENNE GREVE GROTTE HAIE LIGNE MADELEINE LAMBOUR LAMPHILY LANGOAT LANGOGET LANHOUARNEC LANROZ LANVEUR LANVIC PENTE PRAIRIE TROMENIE VILLENEUVE L EGLANTINE LESCOAT LESNEVEN LESPERBE LEZ STEIR L HOSPICE L HYERES LINEOSTIC LOCHOU LOPEREC LOST AR C HOAT MAEZ REUN MALABRY MEILH BRIEUX MENESSIOU MENEZ GUEN MENEZ KERGUESTEN MENEZ LENDU MENEZ MEUR MENEZ ROUE MENEZ ROUZ MESLEIOU MESTREJOU MESYOUEN MEZ AR C HIBOU MEZENES NIVIRIT PARC AN AOD PARC AN PRAD PARC C HROAZ PARC HARO PARC HASTEL PARC SAUX PARC TINAOUT PARC VEIL PARC VRAS PARK MARC H PARK POULIC PELLAN PEN AN CAP PEN AR CREACH PEN AR VALY PENDREAU PENFEUNTEUN PENHELEN PENHOAT PENHORS PENNAROS PENNARUN LAE ROUDOU PONTUSQUET PORS AR SONER PORS GWIR PORSMILIN PORS MORO POUL AR HORRED POULDUIC POULFOURIC POULGAO POULGUINAN POULHON POUL LAPIC PRAT AR GARGUIC PRATEYER QUESTEL QUILLIHOUARN ROCH KEREZEN ROZ AR BIC ROZARGLIN RUEGLAON RUHORNEC ABEILLES EUGENE GUENOLE HERBOT SEBASTIEN SALLE GALLE ALLEMANDS BOSQUETS CAPUCINS CORDIERS DAMES DOUANIERS ECOLIERS FRERES LUCAS GRIVES JUSTICES LAVANDIERES LYS MIMOSAS OISEAUX POIRIERS POTIERS PRES ROCHES BLANCHES SAULES STANG AVALOU STANG DANIEL STANG KERAMPORIEL STANG KERIOU STANG VEIL PELL STANG VIHANNIC STER AR C HOAT VERGERS TI PIN TOUBALAN TOUL AN AEL TOUL AR VILVIG TOULGOAT TOULL AR ROHOU TOULVEN TRAON STIVEL TREGONNOUR TREGONT MAB TREGOUZEL TREOUZON TREVANNEC TROHEIR TROMANE TY MAB FOURMAN TY MAMM DOUE TY NEVEZ KERLAGATU DEUX ROCHES VELEURY DORGEN DOURGUEN DOURIGOU DOURLANOC BORD MER BRUGOU BUIS CAMP KERCARADEC CAMPING CAP D EAU CORNIGOU COSQUER PASSAGE CROISSANT TREFF DUCS DREVEZ GORGEN GOUSFORN SPERNOT GUERDY HALAGE HILDY LECK MANEJ MEJOU LAYOU KERGUESTEN KERRU ROUX MOULIOUEN MOUSTOIR DUNES DUNES TREOMPAN PARADIS PARC ROUZ PASSEUR PENKER COZ ROSPIEC PLATEAU PONTIC QUINQUIS ROHOU RUFA SILENCE SPERNEL STANG STANGALA STER LEI KERGLEUHAN TREFF VAL VIEUX VILLAGE KERLEYOU VRUGUIC VUZUT ECOLIERS EGLANTIERS EGLANTINE ENEZ GLAS ESTACADE FACTEUR FACTEUR KERSAINT FELL FLEURI FLOSQUE FONTAINE FONTAINES KADOR GALETS GARENNES GARVAN GORGEN GOSFORN GOUESTIC GOUGON GOULETQUER GRANGE GREVE KEROHAN GRUGEL GRUGEL KERVEZINGAR HUELLA GUILLEMOTS GUILLY GWESSEILLER GWISSELLIER HOSPICE HUITRES ILE GRISE ILE VERTE ILLIEN SERPIL IROISE KERAMBARBER KERAMBRAZ KERAMBRIEC KERAMPLEN KERANCORDENNER KERANDAILLET KERANGUILLY KERARIS KERBELLEC KERBEOCH KERBERON IZELLA KERBRIGENT KERCOLIN KERDANET KERDANET COSQUER KERDERO KEREQUEL KERFEUNTEUN KERGANOU KERGOZ KERGRIMEN KERGUESTEN KERHOS KERHUEL KERIEQUEL KERIOUANT KERIVOAL KERIVOALEN KERLANO KERLEIGUER KERLIFRIN KERMARIG KERMOJEAN KERNEC KERNEOST KERNILIS KERNIOU KERNISY KERNON KEROHAN KERORGANT KEROUANT KEROUES KEROUZACH KERRICHARD KERRIEN KERSALE KERSEAL KERSYVET KERUSTER KERVARGON KERVELLIC KERVENNOS KERVEUR KERVIHAN KERYANNO KERZERVAN KILOURIN KORRIGANS KRINNOC CARRIERE CROIX LAFFLOSQUE LANNIDY TREVIDY LANNOU LANTIGEN LARENVOIE LAVOIR LAVOIR SAINTE CHRISTINE LAVOIRS LEN GOZ LESQUIDIC IZELLA LESQUIDIC NEVEZ LESQUIDIC TRAON LEURGADORET LEZOUZARD LILAS LINGOZ LINIOUS LISTRY VRAZ LOCAL AMAND LOUIS GUENNEC LOZ MANOIR TREGUER MAT PILOTE MEIL PRY MEJOU KERANDOUIN MELAZE COSQUER MENDIANTS MENEC MENEZ MENEZ BRAS NEVEZ MENEZ BRAZ MENEZ BRUG MENEZ GURET MENEZ KERGUESTEN MENEZ KER GUYOCH MENEZ LAE MENEZ PLENN MENEZ ROUZ MENHIR MESLAN MESSOUDALC H MEZALE MEZALE COSQUER MEZOU VOURCH MINEZ GOUYEN MOGUEROU MOINEAUX MOINES BLANCS MOLENE MOUETTES HENANT KERANGOC TRESIGNEAU MUR NOGUELLOU NOISETIERS NOISETIERS KEREZ OUESSANT PACIFIQUE PARC AR CASTEL PARC BRIS PARC C HOESSANT PARC MARCHE PARC MINE PARK DOUAR BERGE PARK MINN PARK ROZ SERUZIER PEINTRES PEMPOUL PEN PEN AN ENEZ PEN AR C HOAT MOUSTERLIN PEN AR C HRA PEN AR HOAT PEN AR PARC PEN AR PAVE PEN AR STER PENFOENNEC PENFRAT PENHOAT SALAUN PENN AR VUR PENQUER PERCHE FONTAINE SUISSE PHARE TRAIN PEUPLIERS PIGEONNIER PINS PLAGE PLAGE GRISE POINTE POINTE PRIMEL POINTE GLUGEAU ALIBEN AR BLEIZ AR CHLAN PONTI PONTUAL VIAN PORS AR VILIN VRAS PORS CAVE PORSMELLEC PORZ OLIER POUL POUL ALEN POULALLANICK POUL AR COQUET POUL AR GOQUET POUL FANG POULGIGOU POULL CALLAC POULLOGODEN PRAIRIE PRAT CUIC PRAT HIR PRES PRESBYTERE PUITS QUISTILLIC REA RELIQUES RELUYEN RESTAURANT KEROUDOT RESTIC VIAN RIVIERE ROC AR C HAD ROCHES ROCH YAN ROSCAROC ROSCOLER ROUDOUIC ROZ AR GRILLET ROZBEG ROZ DANIELOU ROZ GLAZ RUBIERN RUISSEAU SAINTE CHRISTINE RUMORVAN RUN AN ILIZ RUN AR C HAD RUN AR HAD RUN AR LOUARN RUN LAND RUZ CONAN GONVEL ARGENTON GUENOLE JULIEN PHILIBERT SALAMANDRES SAN DIVY BOURG SERINGAS SKEIZ SOURCE SQUIVIT STADE STANG KORRIQUET STANG ROHAN STER CHLAON TAHLI TERRASSE THEVEN CARN TI BRAS TI LIPIG TI NEVEZ KERLAGATHU TORREYEUN TOUL BLEIS TOUL PESKET TOURTERELLES TRAON BIAN TRAON BIHAN LAMBEZELLEC TRAON KER ILLIEN SERPIL TREAS TREBERRE TREOMPAN TRESSENI TREVARGUEN IZELLA TROBORN TROIS CHENES TROLAN TRONEOLY TROVERN TY KELES TY NOD TY RUZ TY STER TY TAD COZ VALANEC VANDREE VARQUES VERDIERS VERN GLAZ VERT VIEUX CAPTAGE VIEUX FOURS VILLEMET VILLENEUVE VINIOU Y RHEUN AR CHEM QUENEACH CHENES CHENES KERANDRO CHERON CHEROU CHEVREL CHEZ MME GILLES C 153 HENT CHRA BOHAST CHRIST CINQ 4 VENTS ABER AJONCS AJONCS D OR ALLAIN ALOUETTES ANCIENS ARGOAT AR VERN VRICK AR VODENIC AR VODENIG AR VODENNIC AR VODENNIG ARZELLIS AUBEPINES AULNE BARADOZIC BEAUSEJOUR BEL AIR BELENOU BELLEVUE BELVEDERE BIS BOULEAUX BOULVAS BRUYERES BUTTE CAMELIAS CAVENTOU CHALONIC D EAU CLAIR LOGIS COADIC COMMERCIALE COSQUER VIAN CREAC AL LEO CREACH AR LEO CREACH MICKAEL CRINOU CROAS AR BEUZ CROAS AR GARREC CROAS HIR CROIX MISSIONS CROIX ROUGE D ANTIN BELLEVUE KER ANNA KER ELO KERGUELEN KERMENGUY KERSAUX KROAS SALIOU PAIX RUCHE TERRE NOIRE L ODET PORS MOELAN AJONCS D OR ANCIENS BRUYERES CHARDONS BLEUS FLEURS GENETS HORTENSIAS MARRONNIERS PINS RAMIERS ROSIERS TY BOURHIS DUTERQUE DOURIC CHANOINE CHAPALAIN CROIZIOU GUESCLIN MENEZ BIROU PICHERI 18 ALEZ AR GOSKER PICHERI 19 ALEZ AR GOSKER EGLANTIERS EGLANTIERS NIZON ERNEST RENAN AIRIAU FAO FONTAINE FONTAINES FONTAINE FREDLAND FRELAND FRIEDLAND GARO GARVAN GARZABIC GENETS GENETS D OR GLYCINES GOLLEN GORREKEAR GRATZ GUENAN GUENANS GUENON GUEVEN GUILLY GWEL KAER HENT COZ HIRONDELLES HLM KERHALLON VIHAN HORTENSIAS ILES IRLANDE IZELLA JARDINS ASSOLANT JAURES JEANNE D ARC JONQUILLES JULES DUTERQUE JULIA KER ABARDAEZ KERADENNEC KERALGUY PORSMILIN KER ANNA KERARZANT KERAVEL KERAZAN KERBINIOU KERBLEUNIOU KERBRUG KER COADIC KERDOUSSAL KEREAN KERELLEAU KER ELLEZ KERENTRECH KER EOL KERESPERN KEREVER KERFEUNTEN KERFEUNTEUN KERGALL KERGANAVAL KERGAUTHIER KERGAUTIER KERGOFF KERGUEN KERHEUN KERHORNOU KERIBIN KERIEQUEL KERIFAOUEN KERILIS KERINCUFF KERIVARCH KER IZELLA KERJACOB KERJEAN KERLAOUEN KERLEVEN HUELLA KERLOCH KERLOQUIC KERMARIA KERMILIAU KERMILLAU KERMILLEAU KERMILLIAU KERMOOR KERNAOGUER KERNEVEZ KEROHOU KEROZA KERRENTRECH KERRIOU KERROZA KERSABILIC KERSIOUL KERVAO KERVENNEC KERVEUR KERVILAR KERVIZIGOU KERVOAZEC KERYEQUEL LAENNEC LANNIEN LANN ROHOU LANORGANT LAURIERS LESSARD LILAS LOCH POAS MADELEINE MAHE MATHIEU DONNARD MATHIEU DONNART MEIL COAT BIHAN MENEZ BIHAN MENEZ GOUERON MENEZ HOM MESANGES MESMEUR MIMOSAS MONTGOLFIER MOULIN VENT NEVEZ KERBRAT NOTRE DAME ODET ORATOIRE PARC AN DOSSEN PARC EN DOSSEN PARK AR ROZ PARK BRAS PARK FEUNTEUN PARMENTIER PAU PENAMPRAT PEN AR DORGUEN PEN KERNEVEZ PENMEUR PENN KERNEVEZ PESMARCH PHARES PICHERI 15 ALEZ AR GOSKER BENOIT PINS POL AURELIEN POMMIERS PONANT AR MANACH GUEN ILIS PER PORS NEVEZ PRAJOU GUEN PRAJOUS GUEN PRAT HIR PRAT KERGOE PRAT PER PRAT TUDAL PRATUDAL PRIMEVERES QUATRE VENTS QUEFFELEC QUILLIEN QUILLIMADEC RADIOPHARE RAMIERS REUN AR MOAL RHUNEMEZ ROSCO ROSIERS ROUALLOU ROUALOU ROUELLOU ROZ ROZ AVEL ROZENGALL ROZ VOEN RUMERIOU SABLES BLANCS SAINTE BARBE LAURENT MARC MAUDET AURELIEN POL AURELIEN POL D AURELIEN ROCH STANISLAS SAULES SOLEIL LEVANT STADE STANG LOUVARD STANG VIAN STANG VIHAN STANKOU SUFFREN TACHEN FOIRE TACH GLAZ TAL AR VORC H TARROS THEODORE BOTREL TOUL AR HOAT TREGONETER TREGOR TREMENTIN TREMINTIN TROENES TY BODEL TY FEUNTEUN TY GLAS TY GLAZ TY GUEN TY GWEN TY JEROME TY LANN TY VOUGERET USINE HERBOT VELNEVEZ VILIN AVEL VILLENEUVE YANN D ARGENT YEUN PARGAMOU CLAIRE FONTAINE CLARTE CLASTRINEC CLECH BURTUL CLECH MOEN ILE MOLENE ILE QUEFFEN ILES ILE ILE S COMBOUT ILE TRISTAN ILE VIERGE ILIS COZ 19 MARS 1962 ABBADIE ABBE FLEURY ABER BENOIT ACACIAS ADOLPHE BEAUFRERE AJONCS AJONCS D OR AJONCS KEROU ALAIN SAGE ALAVOINE ALBERT CAMUS ALEXANDER GRAHAM BELL BEAU VIGNY NOBEL AL LANN VERTE ALOUETTES ALPHONSE DAUDET ALSACE AMBROISE PARE AMEDEO MODIGLIANI AMIRAL RONARC H ANATOLE BRAS ANATOLE BRAZ AN AVEL C HOUZI AN AVEL VIZ ANCIENNE ROUTE QUILLOUARN AN DIOU GER ANDRE CHENIER ANDRE SUAREZ AN DREZEC ANEMONES ANEMONES MAISON ANJELA DUVAL ANNE BRETAGNE ANSE ANTOINE WATEAU ANTOINE WATTEAU AR C HOAT PIN AR FOENNEC AR FOENNOG AR FORHEN AR GOAREM NEVEZ AR GOAREM VIHAN ARGOAT AR GORRE AR MEAN ARMEN ARMOOR ARMOR ARMORIQUE AR ROZ BRAS AR ROZ BRAZ AR ROZIG AR STERENN AR STIVELL AR VEL AR WAREMM VANAL ATLANTIQUE AUBEPINES AUGUSTE BERGOT AUGUSTE BRIZEUX AUGUSTE RENOIR AUGUSTIN MORVAN AVEL AR MOOR AVEL DRO AVEL VOR AYMER BAIE BALANEC BANINE BAR AL LAN BAR AL LANN BAUDELAIRE BEAUMARCHAIS BEAUSEJOUR BEG AL LANN BEG AN ENEZ BEG AN ISTR BEG AR C HASTEL BEG AR GROAS BEG AR ROZ BEG AVEL BEG LAND BEG LAND RIEC BEG NENEZ BEL AIR BELLEVUE BELLE VUE BELVEDERE BERNARDINE GARREC BERNARD PALISSY BERVILLE BETHEREL BIRDIE BLAISE CENDRARS BLAISE PASCAL BLANQUI BLERIOT BODROC H BOGEY BOILEAU BOISSIERE BONEZE BORIS VIAN BOUILLEN BOUTONS D OR BOUVREUILS BRANLY BREIZ IZEL BRENNANVEC BREZEHEN BRIZEUX BROUAN BRUYERES BUTOU BUTTE CALMETTE CALMETTE GUERIN CAMELIAS CAMILLE COROT CAPUCINES CARBONT CARREC ZU CASSARD CASTEL DOUR CAVE CENDRES CERISIERS CHAISES CHALUTIERS CHAMAILLARD CHAMP LIN CHAPELLE CHARCOT COTTET GOFFIC GOUNOD BASTARD GOFFIC PEGUY VOISIN CHARLIE PARKER CHATAIGNERAIE CHATAIGNIERS CHATEAUBRIAND D EAU CHAUMIERE DAMES CHEMINOTS CHENES CHEVREUILS CHRYSALIDES IMPASSE E D F CLAUDE BERNARD CLAUDE GUEN CLAUDE MONET CLEMENT MAROT CLIZIT CLOS CLOS NEVEZ COAT EOZEN COAT PIN COLBERT COLIBRIS COLLINE COMMANDANT CHARCOT COMMANDANT MOGUEROUX COMMANDANT NOEL COMPAS COQUELICOTS CORAN CORBEAU CORDELIERE CORDIER CORMORANS CORNGAD CORNICHE COSQUER COST AR STER COSTE BRIX COTEAU KERANGLIEN COURBET COURLIS COZ CASTEL COZ DOUAR COZ MANER CRAPAUD CREACH CREAC H CREAC AVEL CREAC GUEN CREACH COAT CROAS AR BLEON CROAS AR VILLAR CROAS KERLOCH CROAS VILAR CROISEUR LEYGUES CROISEUR GLOIRE CROISEUR MONTCALM CROIX ROUGE CRUGUEL CYPRES DAHUT DANIEL BERNARD D ARVOR BENIGUET CANAPE COAT HALEG COAT KERHUEL COAT MENGUY COAT MEZ CORNOUAILLE D ECOSSE CROAS AR GAC CROAS AR VOSSEN GALICE GUERNESEY JERSEY KERANDRAON KERAUDY KERBERVET KERBIRIOU KERDANIEL KERDIDREN KEREDERN KERFOS KERFRES KERGONAN KERGONIAM KERGRENN KERGROAS KERGROES KERGUIDAN KERGUS KERHELENE KERHUN KERIEZOU KERIVARCH KERLEGUER KERLEREC KERLIES KERLIGRISTIC KERLOSQUET KERMAHOTOU KERMARRON KERMEUR KERONTEC KEROURVOIS KERSCOFF KERSENE KERSINAL KERSQUINE KERUCHEN KERVEZINGAR HUELLA KERVIGNOUNEN KERVILER KERZOURAT BARRIERE L ABER CARAVELLE CHAPELLE EDF CROIX ROUGE DUNDEE FEE MORGANE FEE VIVIANE FELOUQUE FONDERIE FONTAINE FONTAINE AU LAIT FORET FORGE FREGATE GALICE GALIOTE GOELETTE LANDE MAISON BLANCHE MAISON ROUGE MINOTERIE MISAINE MONTAGNE MOTTE LANDOUARDON NEF L ANSE PRAIRIE ROCHE ROCHE BEAUBOIS RUCHE SOURCE TOUR TROMENIE VIERGE NOIRE VILLENEUVE L AVOCETTE VOIE ROMAINE L L L ELORN LEN GOZ LESTONAN VIAN L ILE L ODET DELORISSE LUDUGRIS L USINE MENEZ KEREM MENEZ KERGUESTEN MENEZ KERIVOAS MENEZ KERVEADY MENEZ RHUN DENIS PAPIN PENANGUER PEN AR CREACH PEN AR PEN AR STANG PEN AR VALY PEN AR VIR PICARDIE PONTANE NEVEZ PORS AR PAGN DEPORTES POULGALLEC QUELARNOU RUCROISIC 4 VENTS ABEILLES ACACIAS GUENOLE AJONCS ALIZES ALOUETTES ANEMONES BEGONIAS BRUYERES BUTINEUSES CAMELIAS CAPUCINES CARMES CERISIERS CHARDONNERETS CHATAIGNIERS CHENES COLIBRIS COLVERTS CORMORANS COURLIS CYPRES CYTISES DARDANELLES DENTELLIERES DUNES ECUREUILS FAUVETTES FILETS BLEUS FLANDRES FULMARS FUSILLES GENETS GIRONDINS GLAIEULS GLENAN GLENANS GLYCINES GUILLEMOTS HIBISCUS HIRONDELLES HORTENSIAS HUITRIERS HULOTTES DESIRE LUCAS JACINTHES JUSTICES KORRIGANS LAURIERS LILAS MESANGES MIMOSAS MOUETTES NAVIGATEURS OISEAUX ORMES PETRELS PETUNIAS PEUPLIERS PIGEONS PINS POMMIERS PRES PRIMEVERES PROFESSEURS CURIE SQUIVIDAN RENONCULES RESERVOIRS ROCHES ROMAINS ROSES SANTOLINES SAULES SERINGAS SORBIERS TADORNES TAMARIS TANNERIES TOURTERELLES TRITONS TROENES VANNEAUX DETENTE TOUL AN DREZ TREBEHORET VERDUN DEYROLLE DIBEN VIAN DIDIER DAURAT DIEUDONNE COSTES D IRLANDE DIXMUDE D IZAC CALMETTE PALAUX DOLMEN DOMAINE MICHEL NOBLETZ DOSSEN DRENEC DREZIC 18 JUIN 1940 BEREVEN BERGOT D AMOUR BOULOGNE BRICK BRIGANTIN CALVAIRE CAPITAINE COOK CARBONT CELTIC CERISIER ROSE D EAU CLOS DUVAL CONTE COTRE DUCOUEDIC COUEDIC DUCRETET CRUGUEL MORVAN ROUX VOURCH DRAKKAR DRENEC FLIMIOU FROMVEUR GOLVEZ GRAAL KERVEN DUGUAY TROUIN GUESCLIN HEROS DUKE ELLINGTON LARGE LAVOIR LAYOU LOCH MEJOU LEZANNOU MERLE BLANC MINEZ MITAN MOGUER MONITOR KERLOBRETD DUNES NEVET DUNOIS PAYS GALLES PELICAN PERCEVAL POMMIER BLANC PONTIGOU PORS GWIR PUITS REUNIAT REVEL RIVAGE ROALIS ROHOU ROI ARTHUR ROZ RUMEN SOLEIL LEVANT STAND STANG VERGER VIBEN VIEUX ECOLE FILLES ECOLES ECURIEES ECURIES EDMOND CERIA EDMOND ROSTAND EDOUARD CORBIERE EDOUARD LALO EMBRUNS EMERAUDE EMILE BERNARD EMILE ERNAULT EMILE MASSON EMILE SOUVESTRE EMILE ZOLA EN BAS EN BAS LESTREVET EN BAS VILLA KERJANYVIERE EOLE EPERVIERS ERABLES ERIC TABARLY ERMITAGE ERNEST RENAN ESTIENNE D ORVES E TAL ARMOR ETANG EUGENE BELEGUIC EUGENE BERTHOU EUGENE GONIDEC FALC HUN FAUVETTES FERNAND GUEY FEUNTEUN AR ROZEN FEUNTEUN BOL FEUNTEUNIC AR LEZ FEUNTEUNIGOU FEUNTEUN VENELLE FEUNTEUN VERO FIGUIERS FLEURS FLEURUS FONTAINE FOUGERES FOYER 1 ER CUEFF DUAULT LALAISSE VILLON FRANKLIN ROOSEVELT FREDERIC GUYADER FREDERIC MISTRAL FREGATE DECOUVERTE FREGATE LAPLACE FRENES FROMVEUR GABRIEL SIGNE GABRIEL LIPPMANN GAGARINE GALETS GAMBETTA GARENNE GAUGUIN GENETS GENETTES GEORGE SAND BIZET BRASSENS BAIL GERARD NERVAL GIROFLEES GLAIEULS GLENAN GLINEC GLYCINES GOAREM HUEL GOAREM IZELLA GOAREM PERZEL GOARIVEN GOLF GORLANIC GORREQUEAR GOUELET AR LEN GOUEROU GOULET GOULITQUER GOURANOU GRADLON TERRE VENELLE LARGE GREEN GRILLONS GUEL GUELEDIGOU CALVEZ JAN GUIP GUSTAVE FLAUBERT GUSTAVE LOISEAU GUYNEMER GWALARN GWAREMM AR GWELTOC GWEZ KIGNEZ HARAS HAUTS GUERN HAUTS TERENEZ HELENE BOUCHER HENRI BECQUEREL HENRI CARIO HENRI DRONIOU HENRI LAUTREDOU HENT AR VEIL HENT BIAN HENT GLAZ HERVE PORTZMOGUER HIRONDELLES HONORE DAUMIER HORTENSIAS HUEL POULDU ILDUT ILE D YOC K ILE NOIRE ILES IROISE JACINTHES JACQUES BREL JACQUES CARTIER JACQUES DAGUERRE JACQUES GIOCONDI JAOUA JARDINS BARRE BART BOSCO CABIOCH CLOAREC TARTU GIONO JACQUES ROUSSEAU JAURES JULIEN LEMORDANT MERMOZ JEANNE D ARC OBERLE RACINE RICHEPIN JEMMAPES JOACHIM BELLAY JOLAIS JONQUILLES JOS PARKER JULES GOFF JULES MASSENET JULES JULES VERNE KARECK HIR KASTEL KASTELL DOUN KERADEN KERALLE KERAMBAIL KERAMBARBER KERAMBRAS KERAMERRIEN KERANDEN KERANDOUIN KERANDRAON KERANDREON KERITY KERANGUEN KERANGUYON VIHAN KERANGWEN KERANNA KERANROCH KERANROUX KERARMOIGN KERAUDREN KERAVEEC KERAVEL KERAVEN VRAS KERBELEYEN KERBRAT KERBRIANT KERCO KERCONAN BIHAN KERDIES KERDREVOR KEREOL KER EOL KER EOL TREZ HIR KEREON KEREVEN KERFRAM KERFREQUANT KERGARADEC KERGLEDIG KERGLOS KERGOALABRE KERGROES KERGROEZ KERGUELEN KERGUIFFINAN KERGUILIDIC KERHARO KERHEOL KERHOANOC KERHOS KERHUEL KERILIO KERILIS KERILLAN KERIZUR KERJEAN KER JOB KERJOURDREN KERLANO KERLIEZEC KERLOC H KERLOCH KERLONGAVEL KERMADEC KERMARRIEN KERMERGANT KERMORVAN KERNEVEN KERNEVEZ KERNIC KERNIOU KERNONEN KERNU KEROUIL KERRADEN KERRIOU KER ROCH KERROUX KERSIGNAT KERSTRAT KERSUGARD KERUZAOUEN KERVARCH KERVEAL KERVELEYEN KERVEN KERVEROT KERVOERET KERVRIOU KERYOCH KER YS KERZOURNIC KORNOG KORRIGANS KREISKER L ABERWRACH BRUYERE LAC LAENNEC FAYETTE LAFAYETTE GARENNE LANDE LAMARTINE LANCELOT LAC LANDE LANGOUSTIERS LANNIC LANNIGOU LANNOU LANVOUEZ LAPEROUSE LAPIC REGENTE TOUR D AUVERGNE LAUNAY LAURIERS LAVANDIERES LAVOIR LAZARE CARNOT IMPASSE LEHOU NOACH LEO LAGRANGE LEON TREBAOL ROUX LERVILY LESQUIDIC LEUR AR HARDIS LEUR KERVELLEC VAISSEAU REGENT LIAIC LICHEN LILAS LINGOZ LITIRI LOCAL AMAND LOCH LOCH AN TARO LOEIS FLOCH LOEIZ AR FLOCH LOSQUEDIC LOUIS ARAGON LOUIS BLERIOT LOUIS BRAILLE LOUIS LAMOUR LOUIS PASTEUR MAHALIA JACKSON MAISON SAGES MANER BIHAN MANOIR KERBADER MARCEAU CURIE JEANNE GLOANEC MARINE MARNE MARQUISE KERGARIOU MARRONNIERS MARYSE BASTIE MATILIN AN DALL MAURICE BARLIER MAURICE BELLONTE MAURICE BROGLIE MAURICE RAVEL MECHOU MEIL HASCOET MEJOU ERC H MEJOU GLUJURAT MEJOU GLUJURET MEJOU KERLANO MEJOU KERONTEC MEJOU MOOR MEJOU SILINOU MENE MENEZ MENEZ AR VEIL MENEZ BERROU MENEZ BRIS MENEZ GROAZ MENEZ KADOR MENEZ KERGUESTEN MENEZ KERNUN MENEZ KEROUIL MENEZ QUENET MENEZ ROZ MENHIR MER MERLIN L ENCHANTEUR MESANGES MESCAM MEZEOZEN MEZMORVAN MEZOU MEZOU VILIN MICHEL GARS MICHIGAN MIMOSAS MINEZ MIQUEL MIRABEAU MISAINIERS MOGUERIOU MOLIERE MONFORT KERDILES MONTE AU CIEL MOTTE MOUETTES MOULIN ARGENT CARAIT CARN CASSE SALLES D OR GOUEZ MOULINS MOUZOU MYOSOTIS MYRDHINE MYRTILLES NAOD AN NEACH NAVAROU NEREIDES NICEPHORE NIEPCE NOMINOE NORD NOROIT NOTRE DAME NYASSA NYASSAS ODEVEN OISEAUX ORMES OUESSANT OYATS PABLO PICASSO PAIX PALMIERS PALUD BIAN PAQUERETTES PARC PARC AN ABAT PARC AR BRIAL PARC AR FORN PARC AR HOTI PARC AR ROUZIC PARCKIGOU PARC MARR PARC MEL PARC POUDOU PARC TREIS PARC TYRIEN PARK AR FORUM PARK FORN PARK HUELLA PARK LAND PARK MENHIR ABRAM BERT FEVAL GAUGUIN LANGEVIN LEAUTAUD SERUSIER PECHEURS PELLEOC PEN PEN AN DOUR PENANECH PEN NEN PEN AN NEAC H PEN AN NEN PEN AN PEN AR CREAC H PEN AR CREACH PEN AR GUEAR PEN AR GUER PEN AR HOAT PEN AR MEAN PEN AR PAVE PEN AR RHUN PEN AR STREAT PENERVERN PENFELD CREIS PENFOUL HUELLA PENHARS PENHOAT PENITY PENKEAR PENNALAN PENNANECH PEN PAVE PENQUER PENSEES PERHEREL PERROT PEUPLIERS PHARE PHARE FOUR PHILIPPE JACOB ABELARD BROSSOLETTE BELAY HELORET LOTI PERNES PIERRES NOIRES PINS PIOCA PLAGE PLATANES PLATEAU PLATRESSES POMMIERS PONDAVEN IMPASSE CHRIST YAN PORS AN TREZ PORS AR FORN PORS AR VILLIEC PORS BIHAN PORSGUEN PORS MELEN PORS TREZ POSTE POUL AR HOTI POUL AR MARCHE POUL BIAN POUL BOLIC POUL DOUAR POUL DOUR POULL POULLAOUEC POULL DOUR POULPEYE POULPOCARD POULPRAT POUL RANIKET POULROUC POUL POULYOT POURQUOI PAS PRAIRIES PRAJOU PRATAREUN PRAT AR FEUNTEN PRES PRIMEVERES PROFESSEUR DEBRE PUITS QUATRES VENTS QUATRE VENTS QUINET QUINQUAI RADENNEC RADENOC RADENOC PORSTALL RENARD RENE CHAR RENE GUY CADOU RENE BERRE RENE MORVAN RESISTANCE REUN AR MOAL RHEUN RICHEMONT RIVE RIVIERES RIVOALLAN ROBERT HUMBLOT ROBERT JESTIN ROBESPIERRE ROCH ROC H ROC AR SKOUL ROCHE ROCH GLAZ ROCH HUELLA ROC HIGOU ROCHOU BIHAN ROHENNIC ROI SALAUN ROITELETS ROLAND DORGELES ROL TANGUY ROMAIN ROLLAND ROSCAROC ROSE EFFEUILLEE ROSERAIES ROSES ROSIERS ROUX ROUZ ROZ LUTUN ROZ VOEN RUBAYE RUBEO RUCROIZIC LAND RUELLOU RULENN RUN AR MOAL RUNEVEZ RUSTREYER RUTRAON SABLE BLANC SABLES ALAR ANSELME AUGUSTIN AZENOR SAINTE ANNE SAINTE EDWETT ELOI EXUPERY FIACRE HERBOT IMPASSE MER JOSEPH JULIEN MARC MARTIN MICHEL IMPASSE POL ROUX SEBASTIEN USVEN YVES SALLE FETES SAMSON BIENVENU SAOULEC SAPEUR BEASSE SARTHE SCANTOUREC SEMAPHORE SERGE GAINSBOURG SONNEUR SOURCE SOURCES SPHINX STADE STAGNOL STANCOULINE STANKOU ROUZ STEREC STERNES STER NIBILIC STER VRAZ STIVEL STRASBOURG STREAT VEUR STREAT VOAN STREAT VOAN ARGENTON SURCOUF SUROIT SYDNEY BECHET TAL AR MOR TALI TAMARIS TARTANE TAS POIS THEODORE THEODORE BOTREL THEODORE DOARE THEVEN THONIERS TI BANAL TI FORUM TILLEULS TI LOUZOU TOUL AR GALL TOULDON TOULEMONDE TOULL MELEN TOULOUSE LAUTREC TOUR TOUR BLANCHE TOUR D AUVERGNE TOURTERELLES TOUTERELLES TRANSVAAL TRANSVAL TRAON BIHAN TREZ BREMODER TRIELEN TROFEUNTEUN TROIS MATS TROLOGOT TROMEUR TRO NAOD TROUS TROUS AR C HANT TROUZ AR C HANT TULIPES TY BRAZ TY COAT TY DOUR TY GLAS TY GWEN TY LANN TY LOSQUET TY NEZ TYRIEN GLAS TY TRAON TY VARLAES TY VENELLE DERO UTRILLO VALLEE VALLON VALY VAUBAN VEDRINES VELIGUET VENT LARGE VERGER VERGER FERREC VERGERS VERRIER VIBEN VICTOIRE VICTOR HUGO VICTOR SEGALEN VIEUX BOURG VIEUX VIEUX VINCENT VIOLETTES VIVIER VOILIERS VOSGES VOUTE XAVIER GRALL YUNIC YVON SALAUN IMP JACQUES IMP MEZOU VILIN IMP PEN AR STEIR INIBIZIEN INIZIBIAN INTERRIDI INTERRIDY IRVI IRVIT IRZIRY ISAAC ISCOAT ISLE ISLE EN GALL ISLE GOUESNOU ISLE GOURLAY ISLE GRISE ISTREVED WENN ISTREVET AR BARANEZ ISTREVET POULOU IZELLA FALZOU IZELLA KERVASTAL JARD JARDIN BOURG JARDIN BABETTE GORREKEAR JARDIN GLENAN JARDINS BOURG JARDINS PRESBYTERE BOURG JARNELLOU BART JOIE JOLBEC JORNARDY JUBIC JUDICARRE JUGANT JUSPIC JUSTICE JUSTICOU JUSTISOU KADORAN KAERGWEL KALAFARZOU KAMEULEUD KAMPOUALC H KAN AN AOD ROSPICO KAN AN AVEL KAN AR TARZH POULDU KANTREZOG KANTY COZ KAOLIN KERVAO KAOLINS KAOUGANT KARED ATAO KARIT KARN GLAS KARN HENVEZ KARN KERBIRIOU KARN MENEZ BRIS KARN MENEZ GUILLOU KARN MENEZ KERBADER KARN MOEL KARN VEILH KARREC HIR KARRECK HIR KARREG WENN KARRONT AN DRO KASTEL KERAMBLEIS KASTEL KERMAQUER KASTELL AC H KASTELL AR BAIL KASTELL BOCH KASTELL GOURANIG KASTELL MEUR KAVARNO KEF KEF KEFF KEFF KELAREC KELARET KELARNOU KELARRET KELECUN KELEDERN KELER KELERDUT KELERDUT LILIA KELERE KELERET KELERON KELERON VIAN KELERON VRAS KELEROU KELERVEN KELLEREC KELORNET KELOU MAD KENECAOU KENKIS KENKIZ KENNECADEC KER KERABANDU KERABARS KERABEL KERABELLEC KERABERE KERABEREN KERABEUGAN KERABIVEN KERABIVIN KERABJEAN KERABO KERABO GROUANEC KERABOMES KERABOUCHENT GARENNE GARENNE PENHOATHON GARENNE FEUNTEUN VENELLE LAGATJAR LAGATVRAN LAGODENNIC GOEMONIERE 111 ST ANTOINE GRANDE BOISSIERE GREVE HALTE HALTE PRAT LAND NEVEZ ILE METAIERIE MOTTE GRANGE GRANGE KERIOUALEN GRENOUILLERE GREVE GREVE BLANCHE LAGUEN GUERANDAISE LAHADIC HAIE HAIE BRUYERE HAIE HAIE KERFLOCH LAHARENA HAUT HAYE LAHINEC LAHINEC HUELLA LAIR LANDEDEO JUSTICE KERIADENNAD KERGUEVELLIC KORRIGANE KERIERE KORRIGANE TI NEVEZ PENHOAT LANDE L LONDINIERE LORETTE LALOURON MADELEINE LAMARCH MARCHE LAMARE LAM AR GROAS LAMARRE MARTYRE LAM AR ZANT LAMARZIN MASCOTTE 4 PEN ALLEN LAMBABU LAMBADER LAMBARGUET LAMBARQUETTE LAMBAS LAMBEGOU LAMBEL LAMBELL 16 AN ALE VRAS LAMBELL LAMBELL AN ALE VRAS LAMBELL CROAS VOALER LAMBELL GARENN AR LOUATEZ LAMBELL KERBISQUET LAMBELL LANVINIGER LAMBELL NATELLIOU LAMBELL TREVIC LAMBER LAMBER PENHARS LAMBERT LAMBERVES LAMBERVEZ LAMBEURNOU LAMBEZEN LAMBIBY LAMBOBAN LAMBOEZER LAMBRAT LAMBRESTEN LAMBRUMEN LAMDREVIRY MECANIQUE METAIRIE METAIRIE KERGUIFFINEC METAIRIE KERMORVAN METAIRIE PENNARUN MINE MINEE LAMMARC H MONTAGNE MONTAGNE ROI MOTHE CHAUME MOTTE MOTTE TRINITE LAMPAUL LAMPAUL COZ LAMPERON LAMPHILY LAMPRAT LAN LANADAN LANALEM LANAMICE LANANDOL LANANNEYEN LAN AN TRAON LAN AR LAN AR C HALVEZ LAN AR C HOAT LAN AR C HOEZEN LAN AR C HOUEZEN LANARCRACH LANARDE LANARFERS LAN AR GALL LAN AR GARCH LAN AR GOFF LAN AR GROAS LAN AR HEUN LAN AR HOAT LAN AR JUSTICE LAN AR MARC H LANARNUS LAN AR POULLOU LAN AR VERN LANATOQUER LANAVAN LANAZOC LANBEUN LANBONOI LAN BRIAC LAN BRICOU LANCAZIN LANCELIN LANCELIN BIHAN LANCELIN IZELLA LANCLEUZEN LANCONAN LANCORFF LAND LANDANET LANDANET CORENTINE LANDANET VIAN LANDAOUDEC LAND AR BARRES LAND AR COAT LAND AR HOAT LAND BEURNOU LAND CAZIN LAND C HOAT LANDCORFF LANDE LANDEBOHER LANDEDEO LANDEDUI LANDEGUEVEL LANDEGUIACH LANDE KERANTORREC LANDE KEROUAC LANDELEAU LANDE LOTHAN LANDE NEVARS LANDENVET LANDENVET BIAN LANDERNE VIAN LANDES LANDE LAURENT LANDES GERMAIN LANDEVADE LANDEVENNEC LANDEVET LANDGROES LANDIARGARZ LANDIBILIC LANDIDUI LANDIDUY LANDIGUINOC LANDISQUENA LANDIVIGEN LANDIVIGNEAU LANDIVIGNOU LANDIVINOC BRAS LAND JUSTICE LAND KERANTORREC LAND KERGOULOUET LAND KERUSTUM LAND KERVERN LAND KERVIGNAC LAND LOCH LAND LOTHAN LAND MEUR LAND MINE LANDOGUINOC LANDONOI LANDOUARDON LAN DOUARNABAT LANDOUETE LANDOURIC LANDOURZAN LAND PEN OUEZ LAND PLOUEGAT LANDRE LANDREAN LANDREIGN LANDREIGNE LANDREIN LANDREOUAN LANDRER LANDREVARSEC SALLE LANDREVARZEC SALLE LANDREVELEN LANDREVERY LANDREVEZEN UHELA LANDREZEC LANDREZEOC LANDROGAN LAND TREBELLEC LANDUC LANDUGUENTEL LANDVEN LANDVIAN LANDZENT LANDZIOU LANEON LANERCHEN LANESVAL LANEUNET LANEVRY LANFELLES LANFEUST LANFEZIC LANFIACRE LANFIAN LANFORCHET LANFRANC LANGADOUE LANGAER LANGALED LANGANOU LANGANTEC LANGAS LANGAZEL LANGELIN LANGEOGUER LANGERIGUEN LANGLAZIC BEG AR LANN LANGLAZIK LANGLE LANGOADEC LANGOAT LANGOAT HUELLA LANGOAT IZELLA LANGOAT PENQUER LANGOAZEC LANGOLE LANGOLLET LANGOLVAS VIAN LANGONAVAL LANGONAVEL LANGONERY LANGONGAR LANGONIANT LANGONTENIAD LANGOR LANGOUGOU LANGOUILLY LANGOULIAN LANGOULOUMAN LANGOULOUMANN LANGOUNERY LANGOURON LANGOZ LANGREVAN LANGRISTIN LANGROADES LAN GROAS LANGROAS LANGROAZ LANGROES LANGUENE LANGUENGAR LANGUENO LANGUEO LANGUERC H LANGUERCH LANGUERIEC LANGUERO LANGUIDOU LANGUIEN LANGUIFORCH LANGUILLY LANGUILY LANGUILY BRAS LANGUIOUAS LANGUIS LANGUIVOA LANGUOC LANGURU LANGUSTANS LANGUYAN LANGUZ LANHALLA LANHALLES LANHARO HUELLA LANHARUN LANHERIC LANHERN LANHIR LANHOALLIEN LANHOUARNEC SAINTE ANNE LANHOULOU LANHUEL LANHURON LANIGOU LANINOR LANIO IZELLA LANISCAR LANIVIEC LANIVIT LANJULIEN LANJULITTE LAN KERBREZILLIC LAN KERGOULOUET LAN KERGUIPP LANKERMADEC LAN KERNARET LAN KERVIGNAC LANLEAN LANLELL LANLEYA LANLEYA PENQUER LESCLOEDEN LAN LONJOU LANLOUC H LANLURIEC LANMARC H LANMARCH LANMARZIN LANMEUR LANMEUR LANVOUEZ LAN MOUSTOIR LANN LANNAC H LANNALOUARN LANN AMBROS LANNANEYEN LANNANOU BRAS LANNANOU VIHAN LANNAOUEN HUELLA LANN AR LANN AR BIR LANNARCH LANN AR C HOAT LANN AR C HOUEZEN LANN AR GOFF LANN AR HEUNT LANNARIN LANN AR MARROU LANN AR POULLOU LANN AR POULOU LANNARSANT LANNARUNEC LANN BENIGUET LANN BIAN LANN CREAC OALEC LANN DOUARNABAT LANNEBEUR LANNEC LANNEC BRAS LANNEC CREIS LANNECHUEN LANNEC VRAS LANNEG LANNEGENNOU LANNEGUER LANNEGUIC LANNEGUY LANNEINOC LANNELEG LANNELVOEZ LANNELVOUEZ LANNEMER LANNENER LANNENEVER LANNENVAL LANNEON LANNER LANNERCHEN LANNERGAT LANN ER GROEZ LANNERIEN LANNEUNOC LANNEUNVAL LANNEUNVET LANNEUR LANNEUSFELD LANNEUVAL LANNEUVET LANNEVAIN LANNEVEL BIHAN LANNEVEL BRAS LANNEVEL BRAS TRINITE LANNEVEZ LANNEZVAL LANNEZ VIHAN LANN GROAS LANNIC LANNIC ROUZ LANNIC ROUZE LANNIDY LANNIEC LANNIELEC LANNIELLEC LANNIEN LANNIGNEZ LANNIGOS LANNIGOU LANNINOR LANNIOU LANNIRY LANNIVINON LANNIVIT LANN KELLEN SCOLMARCH LANN KERANTOREC LANN KERDILES LANN KERGUEN LANN KERNARET LANN MINEZ LANNOAN LANNOC LANNOC VRAS KERLAN LANNOGAT LANNON LANNOU LANNOUAZOC LANNOU BIAN LANNOU BIHAN LANNOU BRAS LANNOUEDIC LANNOUENNEC LANNOULOUARN LANNOU OUARN LANNOUREC LANNOURIAN LANNOURIEN LANNOURZEL LANN PARCOU LANNUCHEN LANNUET LANNUIGN LANNUNVET LANNUNVEZ LANNURGAT LANNURIEN LANNUZEL LANNUZEL HUELLA LANNUZELLOU LANN VERRET LANN VIHAN LANN VRAS LANN VRAZ LANN WERZIT LANORGANT LANORGARD LANORGUER LANORVEN LANOSTER LANOUAZEC LANOURIS LANOURIST LANOURNEC NOUVELLE MADELEINE LANOVERTE LAN PEN HOAT LANQUISTILLIC LANRIAL LANRIEC LANRIEC 2 HENT PENDUIG LANRIEC 5 HENT PENDUIG LANRIEC 7 HENT PENDUIG LANRIEC PENQUER LANRIEC ROUZ PLEIN LANRIEN LANRIJEN LANRIN LANRINOU LANRIOU LANRIOUL LANRIVAN LANRIVANAN LANRIVINEC LANRIVOAS LANRUC LANSALUD LANSALUDO LANTANGUY LANTEL LANTON LANTRENNOU LANTREOUAR BRAS LANTUREC LANVADEN LANVAIDIC LANVALEN LANVAO LANVAON LANVARO LANVARRO LANVARVIC LANVEGUEN LANVEGUEN MEAN GLAZ LANVELAR LANVELAR BRAS LANVELE LANVEN LANVENEC LANVERC HER LANVEREC LANVERHER LANVERN LANVERN CALAPROVOST LANVERN EGAREC LANVERON LANVERS LANVERZER LANVEUR LANVEUR HUELLA LANVEUR IZELLA LANVEZENNEC LANVEZENNEG LANVIAN LANVIDARCH LANVIGUER LAN VIHAN BELLEVUE LANVIHAN KERGASTEL LANVILIO LANVILLOU LANVILY LANVINIGER LANVINTIN LANVISCAR LANVIVAN LANVIZIAS LANVOEZEC LANVON LANVORAN LANVORIEN LANVOUEZ LAN VRAS LANVREIN LANVREL LANVREON LANVRIZAN LANZANNEC LANZAY LANZENT LANZEON LANZEOU LANZIGNAC LANZULIEN P 16 GARENNE LANVERNAZAL PALUD KEREMMA PALUE PALUS KERGARADEC BOISSIERE GARENNE METAIRIE MOTTE PALUD SAUVAGERE PEUPLERAIE PINEDE CROISSANT KERGUELEN PLAINE POINTE POMMERAIE KERBORCH RAISON LARAON L ARCHIPEL HENT AN DACHENN LARDANVA LARIEGAT RIVIERE LARLAN LARLANT LARMOR ROCHE ROCHE CINTREE ROCHE NOIRE MOLE LAROM LAROM VIAN ROTONDE LARRAGEN LARRET LARRIAL LARRIDEG LARRIN LARVEZ LARVOR LARVOR KERNU LARVOR MEJOU ERCH SALLE SALLE PENHOATHON SALLE POULFONNEC SALLE VERTE SAPINIERE SAPINIERE VERN SECHERIE SOURCE STATION TORCHE VALORDY TOUR LATOUR TRINITE TRINITE COADENEZ TRINITE COATUELEN TRINITE ILIOC TRINITE KERARGOURIS TRINITE KERARGUEN TRINITE KERBALANEC TRINITE KERBINIGEN TRINITE KERIARS TRINITE KERIEL TRINITE KERIVIN TRINITE KEROUHAN TRINITE KEROURIN TRINITE KERSALAUN TRINITE LANNEVEL BRAS TRINITE MESBIODOU TRINITE MESQUINIEC TRINITE PEN AR CHOAT TRINITE PENARVERN TRINITE POULPIQUET TRINITE REUN LATTELOU LAUBERLAC H LAUDEMEUR LAUNAY LAUNAY BRIC LAUNAY COFFEC LAUNAY CURUNET LAVADUR LAVALHARS VALLEE LAVALLOT LAVALLOT BIAN LAVALLOT COZOU LAVALLOT CREIZ LAVALLOT IZELLA LAVALOT LAVALOT COZOU LAVANET LAVANET VIAN LAVEN VENELLE LAVENGAT VIEILLE VIEILLE MINE VIGNE VILLENEUVE LAVILLENEUVE LAVRETAL LD BEG MEIL HENT LEACH AN DREAU LEACH AR PRAT LEACH DREAU LEAC HMAT LEACH MODERN LEAC HREN LEARS LEAVEAN BAND BARRIC BAT BELENOU BENDY BERON BEUX BILOU BIRIT BODOU KERLEGAN BON COIN BOT BOUDOU BOULVAS BOURG BOURG CONFORT BOURG LOGONNA BOURG RUMENGOL BOUS BRUGOU BRUGUET BRUNOC BUDOU BUTOU BUZIT BUZUDIC LEC CABELLOU 15 CORNICHE CABELLOU 1 CORNICHE CABELLOU 27 CORNICHE CABELLOU 9 CORNICHE CABELLOU KERMINGHAM CAMP CANDY CANTEL CAON CAOUT CARBON CARN CARPONT CASTEL CHAMP LECH AR LEUQUER TREVIGNON CHATEL CHENAIE QUILLOUARN CHEQUER CHEVERNY KERAVAL LECHIAGAT 5 HENT AR FEUNTEUN LECH VIAN CLECH CLEGUER CLEMEUR CLEUSMEUR CLEUYOU CLOITRE CLOS CLOSTROU CLOUET LEC NEVEZ COGNIC COMBOUT COMMUNAL CORREJOU MICHEL CORTIOU CORVEZ COSQUER COSQUER GROUANEC COSQUER COTEAU COTY LE CRAN CRANN CRANO CRAS CREACH CREC CRECQ CREDO CREO CRINOU CROAZIOU CROAZOU CROEZIOU CROEZOU CROISSANT CROIZIOU CRUGUEL CUN CURNIC CURRU CUZ DANEN DELE VRAS DERBEZ DIBEN KERTANGUY DIEVET DIROU DISLOUP DIVID DORGUEN DOSSEN 86 POUNT AR C HANTEL DOSSEN 93 POUNT AR C HANTEL DOULOU DOURIC DOUVEZ KERIEGU DREAU DREFF DREFF EARL MURIER DRENNEC DREUZIC DREVERS DROLOU DRUDEC FAO FAOU FELL FERS FERZOU FOENNEC FORESTIC FORESTOU LE FRANNIC FRANSEC FRESQ FRET FRET CLEGUER FIACRE FRET KERARIOU FRET KERBERLIVIT FRET KERELLOT TREMET FRET KERIFLOCH FRET KERINOU FRET KERIVOALER FRET KERVEDEN FRET LEAC HMAT FRET LESVREZ FRET LOSPILOU FRET PEN AN ERO FRET PEN AR CREACH FRET PEN AR POUL FRET PEN AR POUL TREMET FRET PERROS POULLOUGUEN FRET PERSUEL FRET POTEAU FRET QUEZEDE FRET ROSTELLEC FRET DRIEC FRET FIACRE FRET TALADERCH FRET TREYOUT FRET TREZ ROUZ FRET ZORN FROUT FROUT CREIS GALLEAC H GARNEZ GARO GOADRE GOALES GOAS GOAZIEN GOELPER GOLLEN GORED GORRE GOUARVEN GOUBARS GOUEREC GOUERVEN GOURBI PELLA GOURBI CHENE LAUNAY LETY MOROS GRANNEC GRIBEN GROANEC GROAZOC GROUANEC GUELEROC GUELLENEC GUELLIEC GUELMEUR GUERIC GUERLOCH GUERN VRAS GUERRAND GUETEL GUILLOC GUILLY GUILVIT GUILY GUINEL HAFFOND TY PRAT LE HARTZ HARZ HAUT MENIC LEHEC HEDER HELAS HELEN HELLAS HELLES HENGUER HINGUER HOEL HUELLOU LEIGNAC LEIGN AR LEIGN AR HEFF LEIGN AR MENEZ LEIGN BOZEC LEIGNGOURLAY LEIGN HALLEC LEIGNOC H LEIGNONNEZ LEIGNOU LEIGNROUX LEIGN SAUX LEIGNTHEO LEIGNTUDEC LEIGNVEON LEIGNZARCH LEILZACH LEIMBUREL LEIN LEIN AR FORN LEIN AR RUN LEIN AR VOGUER LEINBIGOT LEINDU LEINDU VIAN LEINEURET LEINEURET HUELLA LEINEUS VRAS LEINEUZ VIHAN LEINEUZ VRAS LEINGDERO LEIN HALEG LEINHANVEC LEINLOUET LEIN LOUET LEINON LEINSCOFF LEINTAN LEIN VIAN LEINZACH LEINZAHO LEISTREINA JACQUIDY JAQUIDY JARDIN APPRIVOISE KEROUAL JARDIN PERDU MENEYER KARIB KEF KEFF KEO LABER LABOUS LAND LANN LANNEC LANNIC LANNOU LAUNAY LEC LEIN LEING LEOC LETTY 12 HENT BEG CROASSEN LETTY 15 HENT BEG CROASSEN LETTY 17 HENT BEG CROASSEN LETTY 2 HENT BEG CROASSEN LETTY 6 HENT BEG CROASSEN LETTY TY PALUD LEUHAN LEURE LEZ LIA LICHERN LIORZOU LIVIDIC LOCH LOCH LANDRER LOSCOAT LELOSQUET MANOIR MANOIR HAFFOND MARHALLACH MARROS MECHOU KERZIOU MEJOU MEJOU 4 HENT KERSENTIC MELENNEC MENDY MENEC MENEZ MENGLEUX MENHIR MENHIR GOASVEN MENMEUR MERDY MEZ LEMEZEC HUELLA LEMEZEC IZELLA MEZOU MINE MINEZ MINGANT MOGUER MORDUC MOUILLAGE KERLOC H LE VENT MOUSTER KEREOZEN MOUSTOIR MUNUT MUR LEN LEN AN ITALY NANK HERBOT NAOUNT LEN AR BARRES LEN AR ROZ LEN BAOL LEN C HANKED LENCOAT LENDRER NEIZIC NEO NEZARD NEZERT LENHESQ NIOU NIOU IZELLA NIVER LENN AR BARREZ LENN GOUZ LENNIC LENN KERBERNEZ NODET NOGUEL NONOT L ENSEIGNE LENTEO LENVEN LEN VIAN LEN VIHAN LENZAC H LEOC HEN LEOCHREN LEONGARD PALAIS PALUDEN TREIZ COZ PAOU PARADIS PARADIS KERSIDAN PARCOU PARK YOUENN DREZEN PAVILLON PED PEMPIC PENITY PENKER PENQUER PENQUER DIDY PENTY KERVIGEN PERZEL KERIGOU CEDRE KERMAZEGAN HENANT MANOIR PEULVEN PHARE PHARE TREZIEN PLESSIS PLESSIS KERVAL POIVRE POLHOAT 21 TAL AR HOAD POLHOAT 29 ALEZ GLAZ POLHOAT 35 ALEZ KERBILIEZ POLHOAT 37 ALEZ KERBILIEZ PONTIC PORS PORT PORZ PORZOU POTEAU POTEAU SEVELEDER POTEAU VERT POTEAU VERT KERHUN POULDU 10 DEMEURES HAUT POULDU 20 DEMEURES HAUT POULDU 25 DEMEURES HAUT POULDU POULDU 30 DEMEURES HAUT POULDU CROAS AN TER POULDU HENT AR MOR POULDUIC POULDU KERNEVENAS POULDU KERNICK POULDU KERNOU POULDU KEROU POULDU KERVEO POULDU KERZULE POULDU LOCOUARN POULDU PORSGUERN POULDU PORSMORIC POULDU QUELVEZ POULDU JULIEN POULDU MAUDET POULDU TY MILIN POULLEY PRAJOU PRAT PREDIC PRESBYTERE BOURG PRIOLDY QUAI QUEFF QUELEN QUESTEL QUINQUIS 27 KROAZ HENT QUINQUIS QUINQUIS 38 KROAZ HENT QUINQUIS BEG MEIL 29 CROAS RADEN RANCH RAQUER RECK RELECQ RESTAURANT RESTAURANT COATILEZEC RESTEL REUN REUNIC REUT RHEUN RHU RHUN RHU TREVERROC RICK RIVIER LERLAN ROC 2 PORS AN EIS VINIS ROHAN ROHELLOU ROHOU RONCE ROSCOAT ROSIER ROUAL ROUAS ROUDOU ROUDOUS ROUILLEN SQUIVIDAN ROZ ROZIC 251 COAT PEHEN LERRANT LERRET RUAT RUCHER KERREAU RUGUEL RUMEUR RUN RUSQUEC RUVEIC LERVIR RY 3 CANARDS ABELIAS PHILIBERT SACHZ LESAFF AJONCS ALBIN LESALGUEN SALUT SANOU SAOULEC LESAOUVREGUEN SAPIN VERT BARRACHOU LESBERVET BRUYERES BRUYERES KERENOT BUISSONNETS 12 LESCADEC LESCALVAR CAMELIAS KERFANY KERMEN LESCAO CARRIERES LESCARS LESCAST LESCATAOUEN LESCATOUARN CHALETS LESTRAOUEN CHATAIGNIERS 28 HENT BEG CHAUMIERES KERDRUC CHAUMIERES KERVOELLIC CHENES LESCLEDEN LESCLOEDEN SCLUZ LESCOAT LESCOAT BIAN LESCOAT COZ LESCOAT MORIZUR LESCOAT VIAN LESCOBET SCOET LESCOGAN LESCOM LESCOMBLEIS LESCOMBLEIZ LESCONAN LESCONGAR LESCONIL LESCONIL 5 HENT GWALARN LESCONIL PENAREUN LESCONIL TREVELOP LESCONNAIS LESCONVEL LESCORF CORMORANS PALUD TREBANEC LESCORNOU LESCORRE LESCORRE VIAN LESCORS LESCORVAU LESCOUDAN LESCOULOUARN LESCRAN LESCRANN LESCREAC H LESCREVEN LESCUDET LESCUS LESCUZ LESCUZ IZELLA CYPRES MEOT CYPRES POULMENGUY LESDOMINI LESDOURDUFF SEILLOU SEMAPHORE SENTIER SEQUER LESFORN FOUGERES FOUGERES YEUN CONCILY LESFRETIN LESGALL LESGALL AN TARO GENETS PENVERN GENETS TOUL AR ZAOUT GLACIS GLENAN GLENANS GLYCINES KERVELEN LESGOULOUARN LESGUEN HAUTS LANNIDY HAUTS QUELERN HAUTS VEILLENNEC HORTENSIAS HORTENSIAS KERVOUIGEN ILES SILLON LESINQUIT ISLES LESIVY LESIVY BIAN KORRIGANS LESLAC H LESLAE LESLAE GENTILHOMMIERE LESLAN LESLANNOU LESLANOU LESLAOU LESLARCH LESLE LESLEIN LESLEM LESLEM BRAS LESLEM MESCOAT LESLEM PENNAPEUN LESLEM VIAN LESLEM VRAS LESLEVRET LESLIA LESLOC H LESLOHAN LESLOUC H LESLOUCH LESLOYS LESMAEC LESMAHALON LESMAIDIC LESMEILARS LESMEL LESMELCHEN LESMELCHEN BIAN LESMENEZ LESMENGUY LESMEZ MIMOSAS KERGLEUS LESMINGUY LESMINILY LESMOUALC H LESMOUALCH BIAN LESNALEC LESNALEC AR HOAT LESNAOUENEN LESNARVOR LESNEUT LESNEVAR LESNEVAR DOURIC LESNEVEZ LESNOA LESNOAL LESNOAN LESNOA VIAN LESNON LESNON IZELLA LESNUT LESOMMY ORMEAUX 8 HENT KERIZAC OROBANCHES KERGUINOU LESOUNOC BIHAN PAQUERETTES ROUAL LESPENGAM LESPENHY LESPENHY VIAN LESPENHY VIHAN LESPERN LESPERNON LESPERNOU PETITES SALLES PEUPLIERS 5 HENT SANT FIAKR LESPEURZ LESPIGUET LESPINOU PINS PINS BEL AIR PINS KERGARIOU LESPLOUENAN LESPODOU LESPOUL LESPRITEN LESPURIT COAT LESPURIT ELLEN QUATRE QUATRE VENTS QUATRE VENTS KEROUEL LESQUELEN LESQUELLEN LESQUER LESQUERN SQUERN LESQUERVENEC LESQUIDIC LESQUIFFIOU LESQUIOU LESQUIVIT LESQUIVIT HUELLA RADENNEC LESREN RHODOS KERVIAN ROBINSONS KERVERET VIAN ROCHERS KERJEAN ROSIERS SALLES LESSALOUS SAPINS FROUTGUEN SAPINS VERTS HERMITAGE LESSEYE LESSIEC LESSINQUET LESSIRGUY LESSOUNOC LESSUNUS STANCOU LESTANET STANG LESTARIDEC LESTENACH STER LESTERIOU LESTEVEN LESTEVENNOC LESTEVEN TREMAZAN KERSAINT THUYAS LESCONAN LESTIDEAU GOZ LESTIDEAU IZELLA STIFF LESTIMBEACH LESTIVIDIC LESTONAN 8 LESTONAN NEVEZ LESTONQUET LESTORHEN LESTOUARN LESTOURDUFF LESTRAON LESTRAOUEN LESTREDIAGAT LESTREGOGNON LESTREGUELLEC LESTREGUELLEC NEVEZ LESTREGUEOC LESTREHONE STREJOU LESTREMEC LESTREMELARD LESTREMEUR LESTRENNEC LESTRENNEC LANLEYA LESTRENNEC LUZIVILLY LESTREONE LESTREONEC LESTREOUZIEN LESTREQUEZ LESTREQUEZ NEVEZ LESTREUX LESTREUX VIHAN STREVET LESTREVET LESTREVIAN LESTREVIGNON LESTREZEC LESTRIGUELLEC LESTRIGUIOU LESTRIVIN LESTRIZIVIT LESTROGAN LESTROIS TROIS CANARDS TROIS PIERRES LESTROUGUY LESTUYEN LESUZAN LESVAGNOL LESVANIEL IZELLA LESVEGUEN LESVEN LESVENAN LESVENANT LESVEN BRAS LESVENEZ LESVENNOC LESVEOC LESVERER LESVERIEN LESVERN LESVERN BRAS LESVERN VIAN LESVERN VRAZ LESVERRER LESVERRIEN LESVERRIN LESVESTRIC LESVEZ LESVEZENEC LESVEZENNEC LESVEZ VIAN VIEUX CHENES LESVIGNAN VILTANSOUS BLANCS SABLONS LESVILY LESVOALCH LESVOALIC LESVOE LESVORN LESVOUALC H LESVOYEN LESVREACH LESVREN LESVREZ TAROS LETHY LETIEZ LETIEZ IZELLA TREAS TREBE TREUSKOAD LETTY LETTY HUELLA LETTY IZELLA LETTY VRAS LETY LEUBIN LEUHAN LEUHANCHOU LEUN LEUQUERDENEZ LEURAMBOYOU LEUR AN TORCH LEURANVOYOU LEUR AR LEUR AR BAGAN LEUR AR BOUAR LEUR AR C HALVEZ LEUR AR CLOAREC LEUR AR HARDIS LEUR AR MENEZ LEUR AR MORICE LEURBIRIOU LEURBRAT HENT NOD GWEN LEURCARPIN LEURE LEUREAVEL LEURE BRAS LEURELES LEUREMBOYOU LEURE VIHAN LEURGARRU LEURGARU LEURGUER LEURGUER KERDILES LEURGUER VERVE LEURIAL LEURIOU LEURMELLIC LEURNEVEN LEURNEVEN ALBIN LEUROU BIHAN LEUR PORS LEURRE LEURRE NEVEZ LEURREOU LEURRE VIHAN LEUR SANT MERYNN LEURVEAN LEURVEN LEUR VIHAN LEURVOYEC LEUR VRAS LEUR VRAZ BEUZEC CAP CAVAL LEURZON LEUSTEC LEUZENRENGAN LEUZEUDEULIZ LEUZEULIAT LEUZEUREUGAN LEUZEUREUGANT VALLEE VAQUER VARAC H VARRAC H VARRARC H LEVARZAY LEVENEZ VENIEC VENNEC VENOC VERGER VERGER LANSALUD VERGOZ VERN VERNIC VEROURI VEROURY VEUZ VIEUX BOURG VIEUX VIEUX CHENE VIEUX TRONC VILLAR VILLARD VIQUET VIVIER VIZAC VOT VOURCH VOUSTIC WOUEZ YEUN LEZ LEZABANNEC ZABRENN LEZAFF LEZAGON LEZALAIN ZALUT LEZALVEC LEZANAFAR LEZANQUEL LEZAON LEZARAZIEN LEZARGOL LEZARLAY LEZ AR MENEZ LEZ ARMOR LEZAROUEN LEZ AR STER LEZARZOU LEZAU LEZAVARN LEZAVREC LEZEDEUZY LEZELE LEZENA LEZENOR LEZENVEN LEZEON LEZERDOT LEZEREC LEZERET LEZERGUE LEZERIDER LEZERN LEZ GOAREM LEZHASCOET LEZIDOC LEZIHOUARN LEZINADOU LEZINQUIT LEZIREUR LEZIVIT LEZKIDIG LEZLACH LEZLEIN LEZLIA LEZNEVEZ BIHAN LEZOEN LEZOLVEN ZORN MENEZ SCAO LEZOUAC H LEZOUALCH LEZOUANACH LEZOUDESTIN LEZOUDOARE LEZOUDOARE HUELLA LEZOULIEN LEZOURMEL LEZOURMEL VIAN LEZOUYER LEZOUZARD LEZ PLOUGOULM LEZUGARD LEZUGARD VIAN LEZUREC LEZVEZ LEZ VRAS LEZVREACH L HERMITAGE L HORIZON L HOTEL LI 3 KEROUZINIC LIA LIAISON RADIO LIAVEN HUELLA LIAVEN IZELLA LIBOREC LICHEN LICHOUARN LIDINOC LIENEN LIENEN LILIA AEROPORT ARUNS BAGATELLE BARADOZIC BARADOZOU BARALAN BAS RIVIERE BAZEN HUEN BEAUCHAMP BEAUREGARD BEAUREPOS BECMONT BEG AN HENT BEG AR GROAS BEG AR SPINS BEG AVEL BEG POSTILLON BEL AIR BELLE VUE BELLEVUE BERON BEUZEC BIGODOU BLORIMOND BODERIOU BOSCAO BOSCORNOU BOT BALAN BOTCAEREL BOTCORNOU BOTEGAO BOT FAO BOTHUON BOTLOIS BOTREVY BOTREZ BOTSPERN BOULOUZON BOURG LILIA BOURG NEUF BOURLOGOT BOUTINOU BREGOULOU BREVENTEC BREZALOU BREZEHAN BRIGNEAU BRIGNENNEC BRINGALL CALLAC CAMBLAN CAMPAGNE BRIENS CAMPY CANADA CARPONT CASTEL CASTEL AN DAOL CASTELMEIN CHAPELLE CROIX ROZ CHEF CLEUMERRIEN CLEUSDREIN CLEUSTOUL CLEUZEVER CLEUZIOU CLEUZ VRAS CLOS NEUF CLOS NEVEZ CLUJURY COADENEZ COADIGOU COADRY COAT AMOUR COAT AN ESCOP COATANSCOUR COAT AR GUILLY COAT BIAN COAT CANTON COAT CONGAR COAT CONVAL COAT COURANT COAT CULODEN COATELAN COAT FAO COAT GLAS COAT GRALL COAT GUEGUEN COAT HELLES COAT HUEL COATIVELLEC COAT JESTIN COAT KEROEC COAT LESPEL BIAN COAT LESPEL BRAS COAT LEZ COAT LIVINOT COAT LOCMELAR COAT MARCHE D INTERET NATIONAL COAT MENGUY COATMEUR COAT MEZ COAT MORVAN COAT PIN COAT RHEUN COAT SABIEC COAT SAVE COAT SCAER VIAN COAT TY OGANT COATUELEN COBALAN CONFORT KERVOAD CONVENANT CORREJOU COSGLOUET COSMOGUEROU COSPORJOU COSQUER COSQUER BIAN COSQUER BRAS COSQUER GRENN COSQUEROU COSRIBIN COSTOUR COULDRY COZ CASTEL COZ FEUNTEUN COZ LIORZIOU COZ VILIN CREACH AR BLEIS CREACH BALBE CREACH BURGUY CREACH COADIC CREACH CORCUFF CREACH COURANT CREACH GUIAL CREACH ILLER CREACH MADEL CREACH MILOC CREISMEAS CREMENET CRENORIEN CROAS AN DOUR CROAS AN IVILLER CROAS AR BLEON CROAS AR SANT CROAS AVEL CROAS CABELLEC CROAS HIR CROAS KERVERN CROAS LANNEC CROAS NEVEZ CROAS CROASSANT AR VUGALE CROAS VER CROAZ CHUZ CROISSANT BOUILLET CROISSANT KERMINAOUET CROISSANT KERVEC CROISSANT TY NAOUET CROIX LIEUE CROIX MALTOTIERS CROIX TINDUFF CROIX ROUGE CRUGUEL DERBEZ ROCH AOUREN DIFFROUT DINAN DOMAINE DORGUEN DORGUENIC DOUAR NEVEZ DOUR BRAZ DOURDU DOURDUFF EN TERRE DOUR GAON DOURIC AR GUEN DOUR YANN DREVERS DROLOU DROLOU VIAN ENEZ CADEC ENEZ COAT ENEZ SANG FERMOU FEUNTEUN VILER FONTAINE BLANCHE FORESTIC FORESTIC HUELLA FOUR NEUF FROUTVEN GAMER GARE FORET GARENNE AN DALAR GASPOTEN GAVRE GLASSUS GOANDOUR GOAREM AR ZANT GOAREM COAT GOAREM CREIS GOAREMIC GOAREM MINE HOM GOAREM VORS GOASALEGUEN GOAS AR C HOR GOAS AR HAOR GOAS AR RESTAURANT GOASBIZIEN GOASMOAL GOELAN GORRE BEUZIT GORRE NAOD GORREQUER GOUESNACH NEVEZ GOULHEO GOURIN GRAGINE GRANDES SALLES KERVAO LARGE KERILLEZ ROUTE GRANDS CHATAIGNIERS GREVE KERDREIN GRIGNALLOU GUELEVARCH GUENVEZ GUERLESTAN GUERLOCH GUERN AR MEAL GUERNEVEZ GUERNIGOU GUERRUAS GUERRUS GUERVEN GUERVEUR GUICHEGU GWAREMM TACHENN LAE HAUT LAUNAY HELLEZ HENT AR FOENNEC HENT AR FORN GOZH HENT AVEL DRO HENT COAT HUELLA HENT COAT MENHIR HENT KAN AN AVEL HENT KERIKEL HENT KERLER HENT KERSENTIC HENT MENEZ KERRIOU ILE BERTHOU ILE KERAFRANC ILLIEN AN TRAON INISTIEN JUDEE KARN MOEL KARN STER KASTELL AC H KELERDUT KERABAS KERABEGAT KERABELLEC KERABORN KERACHEN KERADIGUEN KERADORET KERADRAON KERADRIEN KERADROCH KERAFLOCH KERALAN KERALCUN KERALIAS KERALIO KERALIOU KERALUIC KERAMANACH KERAMBARS KERAMBELLEC KERAMBORGNE KERAMBOURG KERAMBROCH KERAMEN IZELLA KERAMER KERAMOAL KERAMPELLAN KERAMPRONOST KERAMPROVOST KERANA KERANCHOAZEN KERANDIDIC KERANDIVEZ KERANDRAON KERANDREAU KERANDREGE KERANDRENNEC KERANEOST KERANEU KERANFORS KERANGALL KERANGOFF KERANGRENEN KERANGUEN KERANGUEVEN KERANHEROFF KERANHOAT HUELLA KERANILIS KER ANNA KERANNA KERANNAOU KERANNOUAT KERANROUX KERANROY KERANSIGNOUR KERANTALGORN KERANTER KERANTRAON KERAORET KERAOUL KERARGAM KERARGUEN KERARGUEN AN DOUR KERARMEL KERARMERRIEN KER ARMOR KERARPANT KER ARZEL KERASCOET KERASTANG KERASTROBEL KERAUDRY KERAVEL KERAVELOC KERAVEZAN BIAN KERAVEZAN BRAS KERAVEZEN KERAVIL KERAVILIN KERAZORET