What does it mean to "know" your data?

You’ve probably heard already that as a data scientists you really need to know the data you’re working with. And for good reason. It is one of the most important roles of a data scientist to have the capability to understand a piece of data before starting to work with it.

It is pretty vague when you just say “know” your data. What does it mean to know your data? What does it entail exactly?

Let me break it down for you and give you some of the things that comes to my mind when I tell someone they need to know their data. I'll group these in 4 categories: Overall data knowledge, understanding of features, awareness of potential problems, and data format.

Overall data knowledge

This is the big picture. Looking at data as a whole will give us a better insight on what might go wrong and what to look out for when doing data exploration and cleaning. Not all will apply to all projects but some examples of things you should consider are:

How was the data collected?
When was the data collected?
How can you reach the data?
Is there more of this data where it came from?
What kind of problems might have occurred due to the collection or storage process? For example:

1. An automated script was collecting the data and on certain days it was down so there might be gaps in the data.

2. There was a character limit on the text field that held news text content and that’s why some data points are not completely collected.)

Understanding the features

On this level, you get more granular and consider each feature separately. Looking at the features closely will give you valuable insights into the problem you’re working on. Many important decisions about the design of the solution are made on this level.

What does each feature mean?
What is the expected minimum/maximum values for each feature?
What is the distribution of each feature?
What unit are the features in (meters/inches/pounds/kilos)?
What is the timezone of the timestamp?
How do we expect each feature to affect the prediction?
How do features correlate with each other, is it what we expect it to be? If not, why?

Problems

There can be problems with the overall dataset or in features. It is crucial for the success of any project to be aware of them, address them and make sure they are contained.

Is the dataset balanced?
Is the data biased towards any group?
Is the data representative of the real-life situation you’re trying to model?
How many (%) missing values are there and what is the best way to address them, why?
How many (%) outliers are there and what is the best way to address them, why?
Is there anything in any of the features that do not make sense? For example, weather forecast data where the temperature value is over 100C in some occasions. Obviously, this is an incorrectly collected data point.

Data Format

Format of the data is the overall shape the data comes in. It is common to change the frequency or granularity of the data. The main questions to address here are:

What is the format of the dataset? (e.g. every row is a data point per person per day)
What should be the format of the dataset? (e.g. you need it to be per family per week)
When changing the format of the dataset how should I aggregate each feature? For example: if you’re turning daily data points into weekly ones, you can take the average of numerical values but how should you aggregate the categorical ones?
How do these aggregates affect the problem, do you need extra features? For example: after you take the average of numerical values, would you also need to include the standard deviation of that feature to not lose valuable information?

Not all of these points will be applicable to all projects and for some, there might be other things you need to consider. But I think this list is a good starting point.

Data science is not an exact science and does not have rules to follow. It is very important to keep an open mind for possible pitfalls, problems and patterns when exploring and getting to know your data. And that is one of the charms of data science: the unique problems you face with every new dataset.

So next time you start a project, don’t just quickly clean your data and jump to modelling. Make sure that you can answer most of these questions. Go as granular as you can and understand how each feature works alone and how they work together. You'll see that it makes a big difference.