ICSI Research Review
Thursday, October 8, 2015
3:00 - 5:20 p.m., ICSI Lecture Hall
Featured talks by ICSI research staff highlighting some of our latest results and new directions in computer science research. Talks will be given in the sixth floor lecture hall.
2:30 Refreshments
3:00 "Header Enrichment or ISP Enrichment: Emerging Privacy Threats in Mobile Networks"
Narseo Vallina Rodriguez
Networking and Security
3:20 "Measurement and Analysis of Traffic Exchange Services"
Mobin Javed
Networking and Security
3:40 "Information Flow Experiments on Ad Privacy Settings"
Michael C. Tschantz
Networking and Security
4:00 "Making Privacy Decisions in Ubiquitous Computing Environments"
Serge Egelman
Networking and Security
4:20 "Limits and Leverage Points"
Barath Raghavan
Networking and Security
4:40 "Combining Randomized Linear Algebra and Stochastic Gradient Descent for Large-Scale Machine Learning"
Jiyan Yang
Research Initiatives
5:00 "MMCommons Dataset and the Social Event Discovery System"
Jaeyoung Choi
Audio and Multimedia
Abstracts
Header Enrichment or ISP Enrichment: Emerging Privacy Threats in Mobile Networks
-
Narseo Vallina Rodriguez of Networking and Security
HTTP header enrichment allows mobile operators to annotate HTTP connections via the use of a wide range of request headers. Operators employ proxies to introduce such headers for operational purposes, and---as recently widely publicized---also to assist advertising programs in identifying the subscriber responsible for the originating traffic, with significant consequences for the user's privacy. In this talk, we will talk about our efforts to identify and characterize HTTP header enrichment in modern mobile networks. In our study, we use data collected by the Netalyzr network troubleshooting service over 16 months. We present a timeline of HTTP header usage for 299 mobile service providers from 112 countries, observing three main categories: (1) unique user and device identifiers (e.g., IMEI and IMSI), (2) headers related to advertising programs, and (3) headers associated with network operations.
This work is in collaboration with Vern Paxson, Srikanth Sundaresan, and Christian Kreibich.
The paper on which this talk is based won the best paper award at the ACM SIGCOMM Workshop on Hot Topics in Middleboxes and Network Function Virtualization (HotMiddlebox 2015) in August.
Measurement and Analysis of Traffic Exchange Services
-
Mobin Javed of Networking and Security
Traffic exchange services enable members to bring traffic to their websites from a diverse pool of IP addresses, in return for visiting sites of other members. We examine the world of traffic exchanges to characterize their makeup, usage, and monetization. We find that the ecosystem includes a range of services, from manual exchanges where participants must solve CAPTCHAs between successive page views, to exchanges that provide tools that automatically surf without requiring any user action. By “milking” a sample of these exchanges, we analyze month-long datasets to examine the nature of URLs that members submit to them. We find a wide prevalence of URLs for services that pay users in return for views to their content, and at least 30% of the requested impressions are for pages that clearly participate in a class of impression fraud called referrer spoofing. We also analyze the size and composition of a sample of these exchange networks by making purchases, finding that the exchanges delivered visits from roughly 200K unique IP addresses, and that in some exchange networks, the majority of visits came from cloud hosting services.
This is joint work with Cormac Herley and Marcus Peinado from Microsoft Research, and Vern Paxson, director of the Networking and Security Group at ICSI.
Information Flow Experiments on Ad Privacy Settings
-
Michael C. Tschantz of Networking and Security
To partly address people's concerns over web tracking, Google has created the Ad Settings webpage to provide information about and some choice over the profiles Google creates on users. We present AdFisher, an automated tool that explores how user behaviors, Google's ads, and Ad Settings interact. AdFisher can run browser-based experiments and analyze data using machine learning and significance tests. Our tool uses a rigorous experimental design and statistical analysis to ensure the statistical soundness of our results. We use AdFisher to find that the Ad Settings was opaque about some features of a user's profile, that it does provide some choice on ads, and that these choices can lead to seemingly discriminatory ads. In particular, we found that visiting webpages associated with substance abuse changed the ads shown but not the settings page. We also found that setting the gender to female resulted in getting fewer instances of an ad related to high paying jobs than setting it to male. Our limited visibility into the ad ecosystem prevents us from assigning blame, but these results can form the basis for investigations by the companies themselves or by regulatory bodies.
This is joint work with Amit Datta and Anupam Datta.
Making Privacy Decisions in Ubiquitous Computing Environments
-
Serge Egelman of Networking and Security
The advent of the smartphone has heralded in an era of unprecedented access to rich user data. This has allowed third-party applications to innovate by supporting new interaction modalities, better integrating with users' lifestyles, and making relevant information more accessible. At the same time, the abundance of personal data presents very real privacy risks. In this talk, I discuss previous and ongoing research to help users make more informed choices about how their personal data is accessed. I present previous work on smartphone platforms that has provided insights into users' behaviors and preferences, as well as how to design systems that empower users to make better privacy decisions. Because wearable and continuous sensing devices are becoming more prevalent, I show how we are applying this work to ubiquitous computing environments.
Limits and Leverage Points
-
Barath Raghavan of Networking and Security
The context and target environments of networked systems research invisibly yet profoundly affect the problems that we select to work on. In this talk I will discuss a new context, motivated by ecological and sociopolitical limits, that I believe is important for systems research to consider for our work to have social impact. I'll then discuss two new projects — on the design of rural networks and on systems for computational agroecology — that I have launched within this context.
Combining Randomized Linear Algebra and Stochastic Gradient Descent for Large-Scale Machine Learning
-
Jiyan Yang of Research Initiatives
In recent years, stochastic gradient descent (SGD) methods and randomized linear algebra (RLA) algorithms have been applied to many large-scale problems in machine learning and data analysis. These methods and variants of them are widely-used in many areas of machine learning and data analysis. We aim to bridge the gap between these two methods in solving constrained overdetermined linear regression problems — e.g., L2 and L1 regression problems. We propose a hybrid algorithm named pwSGD that uses RLA techniques for preconditioning and constructing an importance sampling distribution, and then performs an SGD-like iterative process with weighted sampling on the preconditioned system. We prove that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity. Particularly, when solving L1 regression with size n by d, pwSGD returns an approximate solution with eps relative error in the objective value in O(logn⋅nnz(A)+poly(d)/eps) time. This complexity is uniformly better than that of RLA methods in terms of both ϵ and d when the problem is unconstrained. For L2 regression, pwSGD returns an approximate solution with ϵ relative error in the objective value and the solution vector measured in prediction norm in O(logn⋅nnz(A)+poly(d)log(1/eps)/eps) time. Finally, the effectiveness of such algorithms is illustrated numerically on both synthetic and real datasets, and the results are consistent with our theoretical findings and demonstrate that pwSGD converges to a medium-precision solution, e.g., eps = 0.001, more quickly.
This paper will appear in SODA 2016. It is joint work with Yinlam Chow, Chris Re, and Michael W. Mahoney.
MMCommons Dataset and the Social Event Discovery System
-
Jaeyoung Choi of Audio and Multimedia
The publication of the Yahoo Flickr Creative Commons 100 Million dataset (YFCC100M)---to date the largest open-access collection of photos and videos---has provided a unique opportunity to stimulate new research in multimedia analysis and retrieval. To make the YFCC100M even more valuable, we have started working towards supplementing it with a comprehensive set of precomputed features and high-quality ground truth annotations. As part of our efforts, we released Multimedia Commons Dataset.
Event 360 is an online interactive social event browser which allows the user to explore events detected within the Multimedia Commons Dataset. The system addresses five key aspects of social multimedia event detection and summarization: multimodality, scale, diversity of representations, noise of multimedia items, and missing metadata. The detection algorithm uses unsupervised clustering approach that exploits temporal, spatial and textual metadata. For each detected event cluster, to choose the best subset of photos that meet both relevance and diversity criteria, the system uses hierarchical clustering that exploits both visual and audio information. The system scales well and is effective in producing high-quality summaries of the detected events.
This work was selected as a finalist for the Yahoo! Grand Challenge at the ACM Conference on Multimedia.