Tuesday, November 10, 2015

Reflections on a Data Scientist Job

(v0.1 - 2015/11/10)

I've been working as a data scientist intern for several months now. Here, I want to summarize a little bit on my experience, and also list several things that I found important. Hopefully, I will achieve these goals soon:)

I think there are several tasks for data scientists, including analysis, research, building internal data products and building external data products. The goal for analysis is mainly answer some direct business and engineering questions. A data scientist can expect to receive various questions from different teams constantly. Research tries to tackle more difficult and complex questions in order to provide deeper insights. Then, some of the analysis or research results could be turned into internal data products, mostly as internal web services. Thus, other members in the company can reuse the analysis for their tasks. Finally, if the methodology is reliable and the result is good, we can release the data service as part of the company's product to the outside world!

In a large company, each person can specialize in a particular area. While in a small company, the data scientist typically needs to be good at everything (full snack?). But I think no matter the size of the company, the following goals are always important:

1. Code Quality. Many people think that the data code only needs to run once, so one can just write it in an ad hoc way. But this is not true based on my experience. The code will be reused and the project will become more and more complex. So all the software engineering wisdom needs to be applied in data code as well. Some extra investment in the beginning will save significant amount of work later on.

2. Use Established Tools. One way to produce high quality code and save time is to use established tools. This both includes development tools (e.g. IPython) and data science libraries (e.g. pandas).

3. Communication. Before working on a problem, it would be important to clearly understand the purpose and desired outcome. A lot of the time, the requester might not fully understand the complexity of the data. So some simple measures, such as mean, might not work as expected. After the job is done, it is also important to convey the result to other people. The write up, in the form of an email or slides, needs to be accurate and concise.

4. Feedback. Actively seek feedback from different types of people. Engineers, sales, customer managers, etc., all give very good and diverse comments. The comments could help you improve the workflow and get new ideas. If working on some data products, create an early demo and seek early feedback.

5. Learning. There are several aspects related to learning. First, one shall learn about what's going on inside the company in order to actively find new opportunities of applying data analysis. Some folks might not realize that their tasks can be helped by a data scientist. Second, one shall improve its skills, including statistics, machine learning, etc. Third, one need to search for related work of the current project. Some data tasks are unique, but many others are similar. So learning from other people's experience can significantly facilitate one's own projects.

6. Planning. Apparently, there are many questions waiting to be answered, just like there are many development tickets to be addressed. My experience is that it is usually hard to finish all of them.  So one needs to prioritize. Also, in the beginning of a project, always start with a simple solution. A "bad habit" for people from academia is that they tend to make things over complex... Complexity is sometime required to get your paper published, but it is usually not a friend in a company.

I did not talk about big data, because the dataset I am focusing on is rather small. But again, I would say no matter the size of the data, these goals always applies. In addition, the trend of technology is hiding more details of lower levels, so that people can put more effort in building the tower of technology higher. That means in the future, the interface for analyzing small and big data would be pretty much similar.


Some interesting data science article and data blogs:

[1] The Top Mistakes Developers Make When Using Python for Big Data Analytics
https://www.airpair.com/python/posts/top-mistakes-python-big-data-analytics#3-mistake-2-not-tuning-for-performance

[2] Data Scientist: The Sexiest Job of the 21st Century
https://hbr.org/2012/10/data-scientist-the-sexiest-job-of-the-21st-century/

[3] Data Research from OkCupid
http://blog.okcupid.com


No comments:

Post a Comment