Graduate Seminar with Trong Nguyen: Stratified Random Sampling from Streaming Data

When

November 20, 2019    
1:10 pm - 2:00 pm

Where

3043 ECpE Building Addition
Coover Hall, Ames, Iowa, 50011

Event Type

Speaker: Trong Nguyen, ECpE Graduate Student

Advisor: Srikanta Tirthapura

Title: Stratified Random Sampling from Streaming Data

Abstract: Stratified random sampling is a widely used sampling technique for approximate query processing. We consider stratified random sampling on continuously arriving data streams, and make the following contributions. We present a lower bound that shows that any streaming algorithm for stratified random sampling must have, in the worst case, a variance that is Ω(r ) factor away from the optimal, where r is the number of strata. We present S-VOILA, a streaming algorithm for stratified random sampling that is locally variance-optimal. Results from experiments on real and synthetic data show that S-VOILA results in a variance that is typically close to an optimal offline algorithm, which was given the entire input beforehand. We also present a variance-optimal offline algorithm VOILA for stratified random sampling. VOILA is a strict generalization of the well-known Neyman allocation, which is optimal only under the assumption that each stratum is abundant, i.e. has a large number of data points to choose from. Experiments show that VOILA can have significantly smaller variance (1.4x to 50x) than Neyman allocation on real-world data.

Loading...