How Non-PII Data Can Deanonymize Users

cover
30 May 2024

Authors:

(1) Ángel Merino, Department of Telematic Engineering Universidad Carlos III de Madrid {angel.merino@uc3m.es};

(2) José González-Cabañas, UC3M-Santander Big Data Institute {jose.gonzalez.cabanas@uc3m.es}

(3) Ángel Cuevas, Department of Telematic Engineering Universidad Carlos III de Madrid & UC3M-Santander Big Data Institute {acrumin@it.uc3m.es};

(4) Rubén Cuevas, Department of Telematic Engineering Universidad Carlos III de Madrid & UC3M-Santander Big Data Institute {rcuevas@it.uc3m.es}.

Abstract and Introduction

LinkedIn Advertising Platform Background

Dataset

Methodology

User’s Uniqueness on LinkedIn

Nanotargeting proof of concept

Discussion

Related work

Ethics and legal considerations

Conclusions, Acknowledgments, and References

Appendix

In this section, we give an overview of the relevant literature related to this paper. We divide our literature revision into works addressing users’ uniqueness and works focusing on nanotargeting.

8.1 Uniqueness

We can find previous studies that have addressed user uniqueness in the context of social media and advertising [8] [34] [35], recommendation systems [36], mobile networks [2], credit card purchases [3], or even hidden services [37]. Most of the research in this area focuses on demonstrating how users’ identities can be deanonymized using information that, in principle, may not be considered PII.

A seminal work in this area demonstrated that combining gender, zip code, and birth date was enough to deanonymize the identity of 87% of the 248M citizens in the 1990 US census [5]. This research was replicated for the 2000 US census, which included 281M citizens, and using the same combination of items, the percentage of identified users declined to 63% [6]. This means that 2 out of 3 users could be identified in a user base including 281M users combining only three (non-PII) data points: gender, zip code, and birth date. In this line, a recent work proposed a regenerative model to re-identify users based on demographic attributes and estimated that 99.98% of Americans would be correctly re-identified in any dataset using 15 demographic attributes [7]. Aligned to the previous result, we find two works that report the location and time associated with four mobile phone calls [2] or four credit card purchases [3] allow identifying a user in a database including 1M users with a probability ≥ 90%. Again the conclusion is that 4 (non-PII) data points uniquely identify a user among 1M users.

In another well-known work [4], the authors managed to deanonymize the identity of some users embedded in the supposedly anonymized Netflix Prize dataset (480k users) [38] using film rating entries from IMDB database. The authors proved that 8 movie ratings, along with an approximate date when they were completed, lead to a user identification probability of 99%.

We can also find a recent work that aims to identify users in hidden services, where anonymity is quite important [37]. Authors propose using publicly available Bitcoin payment data in those hidden services to unveil the user identity. The authors conducted a real experiment where they could link 125 unique users to 20 hidden services such as The Pirate Bay [39] and Silk Road [40].

Recent work also implements attribute inference attacks using publicly available non-PII data on players of the popular game Dota2[41]. In that setting, an attacker can unveil private information from a user by exploiting publicly available information in the user profile. Although this work is not intended to make a user unique, it relates to our research since it exploits publicly available non-PII attributes to disclose a potential privacy issue.

More related to our study, some works look into user uniqueness in social media platforms. In [34], the authors use fan page likes of Facebook users to deanonymize them. In [35], users’ web browsing history is used to reveal the identity of those users on Twitter. Finally, in [8], the authors demonstrate that 4 rare interests or 22 random interests Facebook has assigned a user for advertising purposes make a user unique with a probability of 90% among almost 3 billion users.

Our paper extends this literature by showing how a novel piece of non-PII data, such as professional skills, can efficiently re-identify a user. Our results demonstrate that 7 data items (location and 6 skills) make a user unique on LinkedIn with a probability of 75% among a user base of ∼800M users. However, there are two differential aspects in the case of LinkedIn compared to most studies in the literature: (i) the skills and location are information that users report in their LinkedIn profile, and it is publicly available. Contrarily, many of the previous works rely on information that is not easy to obtain, such as the browsing history of the user, record of phone calls or credit card purchases, list of ad preferences assigned by Facebook to a user, etc. (ii) The skills and location information is actionable to reach the users with tailored ads. All the works in the literature but one, i.e., the one using FB ad preferences, do not address how the information used to identify a user can be activated to reach them.

8.2 Nanotargeting

The concept of nanotargeting is not new. Most of us are susceptible to receiving nanotargeting advertising through email, sms, or postal mail. Performing nanotargeting campaigns based on PII information such as email, mobile phone, or postal address is something trivial that has been implemented for many years.

Also, some social media platforms, like Facebook, Twitter, LinkedIn, etc., offer advertisers the possibility of creating the so-called Custom Audiences or Matched Audiences to launch ad campaigns. A standard process to create a custom audience starts with the advertiser providing to the social media platform a file that includes PII information, such as email addresses or phone numbers, of the users the advertiser aims to target. The social media platform matches the provided PII information with its internal records to identify the user registered in the platform. The obtained list is the custom audience. From that moment, the advertiser can use that custom audience to deliver ads to the users included in the custom audience.

We can find real examples that have exploited the custom audience feature to run nanotargeting campaigns. In [31], one of the co-founders of Hawkers describes how they exploited nanotargeting based on custom audiences to target celebrities. He recognizes that they exploited Twitter’s Custom Audiences [42] to show an ad exclusively to Cristiano Ronaldo. Even though the size of custom audiences needs to be at least 1000 users, they included Cristiano Ronaldo using publicly available data in a list where he was the only man. After that, they filtered that audience to launch the campaign only to men in that custom audience. In other words, they managed to nanotarget Cristiano Ronaldo. Their final goal was to make him aware of the brand and to approach them to arrange potential collaborations later, which is a clear example of user manipulation or influence using nanotargeting.

We can find other similar examples that have exploited custom audiences in various manners to nanotarget individuals [43, 44, 45, 46, 47, 48]. However, it is worth noting that using custom audiences to run nanotargeting campaigns fundamentally differs from our work since custom audiences are based on PII.

We can find a few examples in the literature that use non-PII attributes to implement nanotargeting campaigns. Dave Kerpen [49] conducted an experiment launching a campaign directed explicitly to his wife with the following parameters: 31-year-old married female, employees of Likeable Media living in New York City. The advert’s target audience was one user among hundreds of millions of users on Facebook at that time. The advert reached his wife, and only her, and she conducted the same experiment, also reaching Dave and only him with a response ad. In an academic work from 2010 [50], the authors accidentally rely on nanotargeting to infer additional (unknown) personal data from the targeted user. This work describes a technique based on using fine-grained narrow audiences and campaign performance reports. The authors leave the unknown item they want to reveal as a free parameter (e.g., age). They later used multiple attributes retrieved from the Facebook profile of the user and ran multiple nanotargeting campaigns, each of them including a different value for the item they wanted to reveal. If all the campaigns report that they have reached 0 users but one of them indicates that it reached one user, the value of the unknown item is the one used in that campaign.

Our work provides a much more comprehensive vision regarding the nanotargeting issue than the two cited works. First, we provide a theoretical analysis that explains the problem’s dimension and the success probability of nanotargeting campaigns. Second, we run a proof of concept experiment that proves nanotargeting can be systematically implemented within the LinkedIn advertising platform.

Only one previous work in the literature approaches nanotargeting in the same way we do, using non-PII information systematically [8]. This work analyzes how many ad preferences (i.e., interests FB assigns to users based on their activity to deliver them relevant ads) make a user unique on Facebook and whether that result can be used to activate a nanotargeting campaign. The conclusion is that 4 rare ad preferences or 22 random ad preferences are enough to make a user unique on Facebook with a probability of 90%. In addition, this work demonstrates that running nanotargeting campaigns on Facebook is feasible based on users’ ad preferences.

The most important difference concerning our work is that ad preferences are not public information, and getting access to them is not trivial. The authors had to implement a web browser extension and achieve thousands of installations to obtain the dataset they used in their research. In other words, the potential nanotargeting attack described in that work is limited to skilled users capable of obtaining or inferring the ad preferences list of the individual they are targeting. In contrast, this work aims to validate the feasibility of running nanotargeting campaigns at scale using non-PII items anyone can access. That is why we focused on the skills and locations within LinkedIn. Our work does not only show running nanotargeting campaigns is feasible, but it also demonstrates it is plausible for low-skill attackers willing to implement it on LinkedIn.

This paper is available on arxiv under CC BY-NC-ND 4.0 DEED license.