Wednesday, August 3, 2016

Preventing Spam Account Registration

(v0.2, updated 10/5/2016)

I am interested in the topic of preventing spam account registration. Therefore, I have collected, organized and commented some resources from the Web. The current post discusses various approaches at a very high level. I plan to expand the depth and breadth of this post in the future. Any comments are highly appreciated!

---------

We define a spam account of a Web service as an account that is not for legitimate usage of the service. Rather, the purposes could be:

  • increase follower count, view count, etc.
  • manipulating voting
  • sending advertisements (pharmaceutical)
  • spreading malware
  • SEO
Within the complex cyber criminal ecosystem, some people are specialized at creating spam accounts and sell them. Just search "buy facebook account" on Google and you should find sellers like:

  • https://buyaccs.com/en/
  • http://www.allpvastore.com/

Apparently, many activities supported spam accounts could have very negative impact on the Internet. So our goal is to detect and stop spam account registration, or delete spam accounts as soon as possible. However, completely preventing spam account registration might be too difficult, so we can also have a softer goal: to increase the cost of spam account registration as much as possible.

Many approaches have been proposed to mitigate spam account registration. Below are some of them.


1. Preventing automatic account registration

Nothing is more convenient for the attacker if he can register a large number of spam account automatically, either using a single machine or a group of bots. To prevent such attack, the defender needs to differentiate between human visitors and bot visitors. Some approaches are:

1.1 Captcha

The assumption is that visual recognition is an easy task for human, but a hard task for bots. However, the visual recognition challenges can be cracked by bots [1]. Also, even if the challenge cannot be cracked by computer for now, the challenge can still be solved by human crowdsourcing.

1.2 Registration randomization

The HTML code of the registration Web page or even the registration process can be randomized. It probably does not affect a human user, but might disrupt bots that needs to parse the page and follow the process. Some discussion about this idea can be found in [6].


2. Requiring additional information during registration

This type of approaches requires the attacker to provide additional information, such as an email account, a phone number, etc. If obtaining such information costs more than verifying the information, the defender can then apply them to increase the cost of attackers.

2.1 Mobile number and/or email account

This type of information is easier to verify, typically by a confirmation message. It is still possible that the attacker can obtain a large number of mobile numbers or email accounts. For example, Gmail allows you to have (infinite) variations based on one email address. Some services also sells mobile numbers for receiving SMS. But the cost of registration spam accounts is increased.

In addition, the defender can occasionally ask users to reconfirm their mobile numbers or email accounts. This further increases the difficulty of using spam accounts, because the mobile numbers or emails used at the registration time might already become unavailable (e.g., sold to someone else or get blocked) [7].

2.2 Personal identification information

The defender can requests real name plus some personal identifiers (PID) such as passport ID, SSN, etc. Then the defender checks whether the PIDs matches the name, or whether the PIDs are consistent (e.g. used by the SSA website). However, in many cases the defender needs the help of a third party, typically an government agency to do the verification. In addition, some users might not like submitting such sensitive information.


3. IP rate-limiting and blocking

Number of account registration allowed per IP address can be limited. In addition, if a large and abnormal amount of registration comes from the same IP address, the IP address can be blocked [5].

However, this approach might cause collateral damages to benign users because:
  • Dynamic allocation of IP addresses [3]: the blocked IP address might be used later by a benign user.
  • Proxy, gateway: all users behind such a device are blocked from registration.
  • Tor [4]
In addition, this approach might be bypassed if the attacker use proxy, etc.


4. Detecting Spam Accounts

4.1  At Registration

The defender has different kinds of information about a new registration request:
  • Account information, including username, phone number, email etc.
  • Behavior information, such as the time gaps between two consecutive operations.
  • Low-level information, such as IP address, user-agent, etc.

The defender can then train a detector that can separate spam account registration from normal account registration. Such a detector can be constructed based on human insights. For example, [7] proposes to use regex to capture patterns of spam account names. They have patterns because many of them are generated automatically based on some rules. So the defender can sort of reverse engineer the rules.

Of course, the defender can directly train a machine learning model, if labeled training data are available. The defender can also do unsupervised learning (e.g. clustering) to get more insights about spam accounts.

4.2 After Registration

After an account is registered, the service will have more information to differentiate spam accounts and normal accounts. Example signals are:

  • Activities done after registration.
  • Characteristics of the friendship network

These signals are likely to be helpful for a detector. However, at the time of detection, the spam account might already have been used in malicious activities.

4.3 Challenges
  • For a service with large number of users, even a tiny rate of false positive (e.g. 0.1%) will cause huge collateral damage, and thus is not acceptable.
  • The model might become less accurate overtime, due to concept drift.
  • The attacker might maliciously change its behavior to evade the learning algorithm. Or the attacker could even pollute the model.  Several reports have shown that many machine learning algorithms are not robust under adversarial scenarios.


5. Post-registration tracking

The service can store cookies at the client side. Then if the same cookie is used by multiple accounts, there is a possibility that the client registered spam accounts. The service can consider using very persistent cookies, such as the evercookie, which works even cross-browser.


6. Collaborating with other services

6.1 OAuth

Why not offload the nasty task to services with spam account registration prevention, such as Google and Facebook [2]?

6.2 Threat database

For example, the service can check an IP against known IP blacklist.


References:

[1] Snapchat's Latest Security Feature Defeated in 30 Minutes
http://www.infosecurity-magazine.com/news/snapchats-latest-security-feature-defeated-in-30/

[2] http://stackoverflow.com/questions/170152/prevent-users-from-starting-multiple-accounts

[3] https://www.quora.com/How-does-an-ISP-assign-IP-addresses-to-home-users

[4] http://www.computerworlduk.com/tutorial/security/tor-enterprise-2016-blocking-malware-darknet-use-rogue-nodes-3633907/

[5] https://en.wikipedia.org/wiki/IP_address_blocking

[6] ShapeShifter: The emperor’s new web security technology | The Good, The Bad and the Insecure
https://blog.securitee.org/?p=309

[7] Thomas, Kurt, et al. "Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse." USENIX Security 2013