Wednesday, August 3, 2016

Preventing Spam Account Registration

(v0.2, updated 10/5/2016)

I am interested in the topic of preventing spam account registration. Therefore, I have collected, organized and commented some resources from the Web. The current post discusses various approaches at a very high level. I plan to expand the depth and breadth of this post in the future. Any comments are highly appreciated!

---------

We define a spam account of a Web service as an account that is not for legitimate usage of the service. Rather, the purposes could be:

  • increase follower count, view count, etc.
  • manipulating voting
  • sending advertisements (pharmaceutical)
  • spreading malware
  • SEO
Within the complex cyber criminal ecosystem, some people are specialized at creating spam accounts and sell them. Just search "buy facebook account" on Google and you should find sellers like:

  • https://buyaccs.com/en/
  • http://www.allpvastore.com/

Apparently, many activities supported spam accounts could have very negative impact on the Internet. So our goal is to detect and stop spam account registration, or delete spam accounts as soon as possible. However, completely preventing spam account registration might be too difficult, so we can also have a softer goal: to increase the cost of spam account registration as much as possible.

Many approaches have been proposed to mitigate spam account registration. Below are some of them.


1. Preventing automatic account registration

Nothing is more convenient for the attacker if he can register a large number of spam account automatically, either using a single machine or a group of bots. To prevent such attack, the defender needs to differentiate between human visitors and bot visitors. Some approaches are:

1.1 Captcha

The assumption is that visual recognition is an easy task for human, but a hard task for bots. However, the visual recognition challenges can be cracked by bots [1]. Also, even if the challenge cannot be cracked by computer for now, the challenge can still be solved by human crowdsourcing.

1.2 Registration randomization

The HTML code of the registration Web page or even the registration process can be randomized. It probably does not affect a human user, but might disrupt bots that needs to parse the page and follow the process. Some discussion about this idea can be found in [6].


2. Requiring additional information during registration

This type of approaches requires the attacker to provide additional information, such as an email account, a phone number, etc. If obtaining such information costs more than verifying the information, the defender can then apply them to increase the cost of attackers.

2.1 Mobile number and/or email account

This type of information is easier to verify, typically by a confirmation message. It is still possible that the attacker can obtain a large number of mobile numbers or email accounts. For example, Gmail allows you to have (infinite) variations based on one email address. Some services also sells mobile numbers for receiving SMS. But the cost of registration spam accounts is increased.

In addition, the defender can occasionally ask users to reconfirm their mobile numbers or email accounts. This further increases the difficulty of using spam accounts, because the mobile numbers or emails used at the registration time might already become unavailable (e.g., sold to someone else or get blocked) [7].

2.2 Personal identification information

The defender can requests real name plus some personal identifiers (PID) such as passport ID, SSN, etc. Then the defender checks whether the PIDs matches the name, or whether the PIDs are consistent (e.g. used by the SSA website). However, in many cases the defender needs the help of a third party, typically an government agency to do the verification. In addition, some users might not like submitting such sensitive information.


3. IP rate-limiting and blocking

Number of account registration allowed per IP address can be limited. In addition, if a large and abnormal amount of registration comes from the same IP address, the IP address can be blocked [5].

However, this approach might cause collateral damages to benign users because:
  • Dynamic allocation of IP addresses [3]: the blocked IP address might be used later by a benign user.
  • Proxy, gateway: all users behind such a device are blocked from registration.
  • Tor [4]
In addition, this approach might be bypassed if the attacker use proxy, etc.


4. Detecting Spam Accounts

4.1  At Registration

The defender has different kinds of information about a new registration request:
  • Account information, including username, phone number, email etc.
  • Behavior information, such as the time gaps between two consecutive operations.
  • Low-level information, such as IP address, user-agent, etc.

The defender can then train a detector that can separate spam account registration from normal account registration. Such a detector can be constructed based on human insights. For example, [7] proposes to use regex to capture patterns of spam account names. They have patterns because many of them are generated automatically based on some rules. So the defender can sort of reverse engineer the rules.

Of course, the defender can directly train a machine learning model, if labeled training data are available. The defender can also do unsupervised learning (e.g. clustering) to get more insights about spam accounts.

4.2 After Registration

After an account is registered, the service will have more information to differentiate spam accounts and normal accounts. Example signals are:

  • Activities done after registration.
  • Characteristics of the friendship network

These signals are likely to be helpful for a detector. However, at the time of detection, the spam account might already have been used in malicious activities.

4.3 Challenges
  • For a service with large number of users, even a tiny rate of false positive (e.g. 0.1%) will cause huge collateral damage, and thus is not acceptable.
  • The model might become less accurate overtime, due to concept drift.
  • The attacker might maliciously change its behavior to evade the learning algorithm. Or the attacker could even pollute the model.  Several reports have shown that many machine learning algorithms are not robust under adversarial scenarios.


5. Post-registration tracking

The service can store cookies at the client side. Then if the same cookie is used by multiple accounts, there is a possibility that the client registered spam accounts. The service can consider using very persistent cookies, such as the evercookie, which works even cross-browser.


6. Collaborating with other services

6.1 OAuth

Why not offload the nasty task to services with spam account registration prevention, such as Google and Facebook [2]?

6.2 Threat database

For example, the service can check an IP against known IP blacklist.


References:

[1] Snapchat's Latest Security Feature Defeated in 30 Minutes
http://www.infosecurity-magazine.com/news/snapchats-latest-security-feature-defeated-in-30/

[2] http://stackoverflow.com/questions/170152/prevent-users-from-starting-multiple-accounts

[3] https://www.quora.com/How-does-an-ISP-assign-IP-addresses-to-home-users

[4] http://www.computerworlduk.com/tutorial/security/tor-enterprise-2016-blocking-malware-darknet-use-rogue-nodes-3633907/

[5] https://en.wikipedia.org/wiki/IP_address_blocking

[6] ShapeShifter: The emperor’s new web security technology | The Good, The Bad and the Insecure
https://blog.securitee.org/?p=309

[7] Thomas, Kurt, et al. "Trafficking Fraudulent Accounts: The Role of the Underground Market in Twitter Spam and Abuse." USENIX Security 2013

Tuesday, November 10, 2015

Reflections on a Data Scientist Job

(v0.1 - 2015/11/10)

I've been working as a data scientist intern for several months now. Here, I want to summarize a little bit on my experience, and also list several things that I found important. Hopefully, I will achieve these goals soon:)

I think there are several tasks for data scientists, including analysis, research, building internal data products and building external data products. The goal for analysis is mainly answer some direct business and engineering questions. A data scientist can expect to receive various questions from different teams constantly. Research tries to tackle more difficult and complex questions in order to provide deeper insights. Then, some of the analysis or research results could be turned into internal data products, mostly as internal web services. Thus, other members in the company can reuse the analysis for their tasks. Finally, if the methodology is reliable and the result is good, we can release the data service as part of the company's product to the outside world!

In a large company, each person can specialize in a particular area. While in a small company, the data scientist typically needs to be good at everything (full snack?). But I think no matter the size of the company, the following goals are always important:

1. Code Quality. Many people think that the data code only needs to run once, so one can just write it in an ad hoc way. But this is not true based on my experience. The code will be reused and the project will become more and more complex. So all the software engineering wisdom needs to be applied in data code as well. Some extra investment in the beginning will save significant amount of work later on.

2. Use Established Tools. One way to produce high quality code and save time is to use established tools. This both includes development tools (e.g. IPython) and data science libraries (e.g. pandas).

3. Communication. Before working on a problem, it would be important to clearly understand the purpose and desired outcome. A lot of the time, the requester might not fully understand the complexity of the data. So some simple measures, such as mean, might not work as expected. After the job is done, it is also important to convey the result to other people. The write up, in the form of an email or slides, needs to be accurate and concise.

4. Feedback. Actively seek feedback from different types of people. Engineers, sales, customer managers, etc., all give very good and diverse comments. The comments could help you improve the workflow and get new ideas. If working on some data products, create an early demo and seek early feedback.

5. Learning. There are several aspects related to learning. First, one shall learn about what's going on inside the company in order to actively find new opportunities of applying data analysis. Some folks might not realize that their tasks can be helped by a data scientist. Second, one shall improve its skills, including statistics, machine learning, etc. Third, one need to search for related work of the current project. Some data tasks are unique, but many others are similar. So learning from other people's experience can significantly facilitate one's own projects.

6. Planning. Apparently, there are many questions waiting to be answered, just like there are many development tickets to be addressed. My experience is that it is usually hard to finish all of them.  So one needs to prioritize. Also, in the beginning of a project, always start with a simple solution. A "bad habit" for people from academia is that they tend to make things over complex... Complexity is sometime required to get your paper published, but it is usually not a friend in a company.

I did not talk about big data, because the dataset I am focusing on is rather small. But again, I would say no matter the size of the data, these goals always applies. In addition, the trend of technology is hiding more details of lower levels, so that people can put more effort in building the tower of technology higher. That means in the future, the interface for analyzing small and big data would be pretty much similar.


Some interesting data science article and data blogs:

[1] The Top Mistakes Developers Make When Using Python for Big Data Analytics
https://www.airpair.com/python/posts/top-mistakes-python-big-data-analytics#3-mistake-2-not-tuning-for-performance

[2] Data Scientist: The Sexiest Job of the 21st Century
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

[3] Data Research from OkCupid
http://blog.okcupid.com


Saturday, September 12, 2015

Language and Intelligence




Part 1 International Students

I learned from my girlfriend and her friend that your language ability determines your intelligence, particularly for an international student. Intelligence can be decomposed into two parts. The first part is whether you can understand some materials and come up with good ideas, which I assume should not be difficult for many students. The second part is whether you can convey your thoughts to others (intelligence). Unfortunately, I found that many international students, including myself, are struggling, because our ability to speak the language (e.g. English), can sometime significantly limits how we express our understanding and ideas. This could give others an impression that this guy is not intelligent. Or think about an extreme case where I was thrown to Moscow. Without knowing a single word in Russian, I cannot express anything through language. So to the Russians, I am equivalent to a complete idiot.

So it is very important to improve one's language to improve the perception of one's language by others.

Part 2 Mathematics

Many people say math is hard, and they cannot learn math well because there are not that intelligent. But it might be the case that they have the intelligence to learn math, but they do not have the language to read math. Using the example in Part 1, if I am in Moscow, I cannot book a hotel or visit the hospital, although nobody would consider these two tasks are beyond a normal human being. I cannot do these things because I don't know Russian. Similarly, you cannot do math if you don't know the language. The language of math looks a little bit intimidating, but I think it should not be more difficult than a human language (might be related to Chomsky hierarchy [2]). So I think people who are afraid of learning math can just view it as learning a new language. Learning a new language requires constant input and practice, so this explains why we should learn things (e.g. math) constantly as well.


References:

[1] The Figure. http://edl.ecml.at/Portals/33/images/EDL_Logo1.jpg

[2] https://en.wikipedia.org/wiki/Chomsky_hierarchy






Thursday, March 26, 2015

Risk Analysis of the Flight 9525



The crash of Germanwings Flight 9525 is really a great tragedy. It is even more sad to remember that there were multiple large-scale aviation accidents, including MH370, MH17 and TNA222, in the past two years. Many people, including myself, would wish that we can have better technologies and policies to reduce the likelihood of such events or even prevent them completely.

In the ongoing investigation of the Flight 9525 accident, we learn that the co-pilot locked the cockpit while the caption was outside, and then brought down the plane. The exact motive of the co-pilot is unknown. Apparently, airlines have tests the mental conditions of pilots, but currently there is no report indicating that the co-pilot is abnormal [1].

In this article, I would like to discuss how we might be able to decrease the risk of such accident. I am inspired by the following article written by professor Juliette Kayyem [2]: Was 9/11 safety precaution a flaw? From the title, we can already know one main point of that article: the cockpit lock-up mechanism designed to prevent 9/11-style attack becomes a problem when one of the pilots goes wrong. The author has suggested to have an emergence password so that no one can block the access to the cockpit. I definitively think this is a good idea, but we should think deeper by considering the risks of different threats, and how these risks tangle together.

There are many threats to an airplane: hijacking, mechanical errors, pilot errors and malicious pilots. Each threat has a risk value which can be simply calculated as likelihood * impact. We will only focus on the likelihood part in this article. We want to reduce the likelihood of every threat to be lower than certain threshold. However, this case clearly shows the difficulty, because one mechanism that reduces the likelihood of a threat (e.g. hijacking in this case) could increase the likelihood of another threat (e.g. malicious pilot). In this particular case, the cockpit lock-up mechanism is not good because the likelihood of malicious pilot has been increased above the threshold.

One might think it is necessary to have the lock-up mechanism to defend against hijackers, and we have to sacrifice on other aspects. But I don't think so. I think there are many other ways to reduce the likelihood of hijacking, such as security check points, on-board security guards which I've seen in Chinese domestic flights several years ago, and background check of passengers. This line of defenses are probably able to reduce the likelihood of hijacking to an acceptable level. On the other hand, however, we do not have reliable methods to prevent malicious pilots. As we have discussed previously, mental tests are not useful in this case at least. And due to the complexity of this job, we have to give many authorities to the pilots. Being able to unlock the cockpit, therefore, become an important defense line for malicious pilots. But unfortunately, this defense line was turned off for Flight 9525...

Another idea is to consider self-flying airplanes. After all, we already have self-driving cars. At least, the airplane could become a remotely controlled drone in emergence. This would not only help Flight 9525, but other cases when the pilots lost conscious, such as the Helios Airways Flight 522. But having an self-flying system introduces new threats such as software bugs or even vulnerabilities, which are major threats of all kinds of digital systems now. Should we trust human or machine?



References:

[1] Lufthansa CEO: Germanwings copilot passed medical exams http://www.cnn.com/2015/03/26/europe/lufthansa-ceo-germanwings-crash/index.html

[2] http://www.cnn.com/2015/03/26/opinions/kayyem-germanwings-co-pilot/index.html


Sunday, February 15, 2015

The Path of Sergio Leone




(v0.1)

Sergio Leone is one of my favorite movie directors. One can definitely attribute his success to genius and hard working. However, I think it is also interesting to take a look at his path of making films, as we might be able to learn something from it.

He has roughly made 10 films over 25 years (1959 - 1984) [1]. His first two films were rather bad according to the rating on imdb. However, these two films probably gave him enough experience to make a better one. So he made the third one, a Fistful of Dollars, in 1964. This movie was a huge success, but with one problem: he basically plagiarized the story of Yojimbo, by Akira Kurosawa. Personally, I am fine with his deed, because he made a great film after all, and he has compensated Akira. But more importantly, I think this step might be necessary for a young director like him, as he lack the experience of writing a good story. Such imitation is probably the fastest way to become a master.

Since the 1964 movie was a huge success, the wise idea is to make sequels of it. It can hone his ability further and get reputation and money quicker, with very little risk. So we have For a Few Dollars More (1965) and The Good, the Bad and the Ugly (1966). Then, Leone was already a mature director, and it was the time to climb the high mountain in his life. The previous three movies are all Western, but the stories are constrained by each other. He needed to break out from the trilogy in order to fully exploit his creativity, while also utilize his experience in Western films. So he directed the Once Upon a Time in the West in 1968, which is one of the greatest movies of all time. This movie also made him one of the greatest directors.

The four films had probably exhausted his creativity in Western  settings, so his eye turned to the past of Mexico and produced Once Upon a Time... the Revolution in 1971. It is also a great movie, because at this time, Leone already reached the top level, so it was impossible to make a low quality one anymore.

Then in the next 13 years, he stopped directing. It probably because he was tired and needed some rest. Also, he was preparing the next big shot. The next movie, finally arrived at 1984, was in a completely different setting compared to all his previous films. It is the Once Upon a Time in America which describes the lives of several gangsters in the New York City. I personally think this movie has reached the apex of filmography. The story, the acting, the scenes, the music ... are all the best. He is a true master.

Then, of course, we just need to expect one masterpiece after another from him, until his death. However, the ending of his life came rather soon because his body cannot catch up with his great mind. He died at 1989 when preparing Leningrad: The 900 Days.

I think he had a fantastic life with invaluable contributions to the humanity. I also think his path is similar to many great minds in other fields, such as academia (e.g. replacing films with research publications in the main text). I hope we all can get some inspirations from the paths of these forebears.


Additional remarks:

  • We should also emphasize the contribution of Ennio Morricone, who made superb music for Leone's movies. Their life long collaboration is also worth remembering. 

References:

[1] http://en.wikipedia.org/wiki/Sergio_Leone#Filmography

Saturday, January 31, 2015

Notes on the GHOST Bug

The recent GHOST bug discovered in glibc is a heap buffer overflow that could potentially lead to arbitrary code execution. I am interested to learn about this bug because I am working on heap buffer overflow defense. So I read the post written by Qualys Security Advisory, which really provides excellent explanation of it! [1]

This article just contains some of my notes when learning this bug. Hope they will also be helpful to others and please feel free to provide your comments.

(1) Why can't we detect this bug earlier?

It has been said that this bug existed since 2000. So an important question is why can't we detect it earlier? The article written by Qualys indicates that they found it through a manual code review. So probably the code has not received enough eyeballs previously.

On the other hand, I also think this bug is fairly easy to be detected by fuzzing. Because it is actually very easy to create test inputs and the oracle in this case. On the other hand, it is probably not easy to find the Heartbleed vulnerability through fuzzing, because both the test inputs and the oracle are hard to build.

I have wrote the following simple program that could trigger the vulnerability. We can think it as a very simple fuzzer.

https://github.com/movingname/Toys/blob/master/C/GHOST2.c

We can use the AddressSanitizer as the oracle. So I used clang + AddressSanitizer to compile it. Then when I ran it, AddressSanitizer indeed reports a heap buffer overflow.


I guess one could do a round of fuzzing for all this kind of functions in libraries. Maybe more bugs can be found?


(2) procmail exploit

The article [1] shows how we can exploit this bug in procmail using

/usr/bin/procmail 'VERBOSE=on' 'COMSAT=@...ython -c "print '0' * $((0x500-16*1-2*4-1-4))"` < /dev/null

However, this command has some omissions (the ... in the middle). Actually, one can run

/usr/bin/procmail 'VERBOSE=on' 'COMSAT=@'`python -c "print '0' * $((0x500-16*1-2*4-1-4))"` < /dev/null

to trigger the glibc detection.

In addition, the length of the input is important. Apparently it cannot be too small. But it also cannot be too large because procmail will detect the overflow. there is a tiny window that will trigger the overflow.



References:

[1] http://www.openwall.com/lists/oss-security/2015/01/27/9