Burning questions of data science

In every profession, there are disagreements between the members of the community. Most of the time, the quarrels happen either because both options are equally viable or there is very little evidence to prove one way or the other. And sometimes, people disagree just because they have different preferences and the choice is very subjective.

Having an opinion on these disagreements is a neat cheat to look and feel like part of the community. Sooner or later you will be in the middle of these discussions anyways. I just want to give a small boost with this article.

I will list some of the disagreements I’ve heard over the years in data science circles and share my personal opinion on them.

Python vs R

You might have heard this discussion before you even started studying data science. It is everywhere on the internet, everyone has something to say about it, and some people have very strong opinions on it.

If you think caring about which language you use this much is silly then I'm with you. But it might just be one of the first things your colleagues ask you when you start working.

I have to admit that it is fun to discuss with your colleagues and go back and forth on all the pros and cons but it might be doing more harm than good. I see many aspiring data scientists confused to the point of decision paralysis on this. They naturally want to make the right choice but all the discussion on the internet is not helping them.

If I had to choose a side (which I don’t, but I still will), I would prefer Python. Mostly because I'm more comfortable with it, I already used it a lot before, and I can start getting results faster. That's also the language I recommend when I get asked. I find it intuitive and easy to learn. Moreover, there is a great community behind Python that will provide you with answers and support when you get stuck. Let alone all the amazing libraries that make your job much easier.

Though I have encountered many people who strongly prefer R and their reasons seem to be similar.

But hey, let’s look on the bright side. If both languages have serious die-hard fans, it might just mean that both are very good languages!

Matplotlib vs ggplot2

This is originally an extension of the Python vs. R discussion. Matplotlib is the go-to visualization tool when using Python and ggplot is what people go for with R. People mostly criticize matplotlib for its inability to create beautiful diagrams. A friend of mine recently sent me a meme on this which I think clearly describes the whole discussion.

Let me show you examples of plots generated by each library. Of course, I agree that ggplot2 plots look much nicer to the eye without putting int extra effort. But at the same time, how beautiful do you need your plots to look when you’re just analyzing away. Most of the time, as long as they show you what you need to see, it’s alright.

I find that matplotlib plots are just as functional and adaptable as it's ggplot2 counterparts. Some R experts might disagree with me on this one.

One secret weapon of matplotlib (or more generally Python) is the additional Seaborn library that can make pretty kick-ass graphs/plots. Your move, R.

seaborn.scatterplot — seaborn 0.10.1 documentation — Source: Seaborn

Maintenance or no maintenance?

Another debate I’ve heard in data science circles is the questions of should a data scientist maintain the model he/she developed? The dominant reaction to this question is either a very strong "Yes, of course!" or "No, of course not!".

So what is model maintenance? Model maintenance is simply making sure the model still works as intended after it has been deployed and is being used by end-users. This could be:

Making sure the model is up to date with changes in real life
Making sure the model performance is still acceptable and is not deteriorating below a certain level
Making necessary updates when the input data format changes
Adapting to changes in data quality
And many other things...

As you can imagine, it is a lot of work to keep an eye on models and making sure everything works. Some part of the Data Science community thinks that data scientists are fully responsible for the models they made because they know the models the best, and they should be the ones to keep an eye on them. Whereas others think that a data scientist’s job ends when he/she delivers a working model together with an explanation/documentation of it and that there should be specifically trained professionals to do model maintenance.

My opinion on this is basically: depends. If it’s a small company with limited resources, it is normal to expect the data scientist to be responsible for the maintenance of models. But if it’s a bigger company, it makes more sense to have someone else maintain the models because it can easily become too much work to handle for a data scientist.

Data Science is a dying profession vs Data science is just starting to bloom

There is a lot of debate on whether data science bubble is bursting. Some say that no one will want to become data scientists in 5 years and some say that the technology is just getting started and it is going to be the next big thing.

I think both views are valid. AI, ML and data science are definitely hyped up. That's why I think it is reasonable to expect a decrease in demand as time goes by and the hype dies down. But at the same time, there is still a lot we can achieve with it. I don't mean it in the sense that the algorithms will get better, more scalable, and we will achieve bigger goals. Rather, I believe that there are still industries that are just now starting to adopt ML techniques in their work. That’s why it has room to grow and be a bigger part of a wider selection of domains. I guess you can call it a horizontal expansion rather than a vertical one.

Specialization vs Generalization

Some data scientists think that it is better to become very good at one technique or discipline such as NLP or image processing. While others prefer to stay a data science generalist and train themselves at understanding the needs of a project and implementing a variety of tools.

Many times, this debate stems from people claiming that one way is better than the other when it comes to data science. I think it is a very personal decision and there isn't really a good or bad option.

For now, I prefer to be a generalist who can learn things fast when need be. I believe it makes me more efficient. I also quite like trying on new tools and approaches.

But, if I ever need to implement an LSTM network, I would probably need to at least get input from an expert in that area and most likely even work with them.

All in all, I believe, in data science, the required technique can vary a lot. And at the end of the day, there is a need for both generalists and specialists.

Have you heard any other hot topics in data science? Something that people were creating long threads on Reddit over? Comment and let me know!