Problem to machine learning solution: formalizing the real-world

One of my student’s on the Hands-on Data Science course asked me an intriguing question this week.

After completing the course, he reminded me of the project goal we formulated at the beginning of the course and wanted to know how the model we built helped solved this problem that we defined.

The problem definition is: in New York City, how can we make sure that the assignment of taxis to regions can be made fairly so that each taxi driver has equal income potential.

As in, no taxi driver is constantly assigned to a high-income area where another one might be sent to work in a low-income area day after day.

And the answer to his questions was: it doesn't. At least not directly, but with good reason.

What we do in the course is building a model that predicts the expected income of a taxi driver in a certain area of New York in a given day and hour. We, of course, come up with some other features to improve the model performance and train and tune the model to achieve high accuracy.

But estimating how much money a taxi driver makes in a certain area in a given hour does not guarantee fairly sharing the market. That decision layer is something that would still be needed to be built on top of the model itself.

But why don’t we just make a model that makes this decision for us too?

The answer is: compartmentalizing.

Data science and machine learning models do not provide answers or solutions to real-life problems directly. They are not supposed to. They are tools to be used on the path to solving these problems. They provide inputs and insights to help humans to make decisions or to build systems that would make these decisions.

But, still, why not?

Well, most ML models are inexplainable. They get the input and they make decisions based on reasoning we cannot directly observe. If we make our whole solution an ML model, we will not be aware of how it works and why it works. By compartmentalizing different sections of the problem, we are taking back control. And of the most critical part too (which is the final decision process).

Of course, letting humans be the final deciders does not guarantee fairness. How we use the results matter a lot. There was a famous case where a university used a model that predicts which students will eventually drop out of college as a reason to encourage students to drop out before the school needed to report enrollment numbers to the government in hopes to report better retention numbers. [Are At-Risk Students Bunnies to Be Drowned?] This is clearly unfair. Because just because a model trained on previous students’ data says this individual is likely to drop out does not mean this person will actually drop out. We are all individuals and not a proxy of our ethnicity, race, gender or background. And we all deserve to be treated as such.

Another school used this seemingly inherently evil model in a rather good-hearted way. They offered and provided extra support to students who the model deemed to be the most likely ones to drop out of school, in hopes to reduce the number of kids leaving education. [Artificial Intelligence in Higher Education]

As data scientists, we need to be aware of the limitations of ML models. Specifically their strengths and weaknesses. We ought to know for which part of the problem to employ ML techniques and when to use common sense and experience. Only then we can formalize the problems at hand in a suitable way.

This is one of the most important, if not the most important, skills of a data scientist.