It’s unusual to hear cats mentioned in a presentation about machine learning. But they actually have more in common than you would think.
Ismail Elouafiq, a Data Scientist at SVT has drawn a genius association between machine learning systems. (Disclaimer: Ismail’s presentation and this article contain mentions of cats, so people that are allergic or just don’t like cats, you’ve been warned.)
Similarities between machine learning and cats
“Imagine you are working in a cat hospital”, starts Ismail, and you admitted 132 cats which are victims of jumping off the window. After analysing the unpleasant events, it seems like the higher the building is, the higher the probability the cats surviving. 21 out of 22 cats survived falling from a 7-story building, which means that 99% of the cats actually survived, states Ismail.
Of course, cats are famous for landing on their feet and having 9 lives to spare but statistically looking there’s something fishy in the picture – the dataset itself, clarifies Ismail. The datasets of the surviving cats it’s actually the dataset that was admitted to the hospital, so if a cat doesn’t survive it’s not taken to a hospital. Same goes for cats that have fallen from lower stories and haven’t sustained any injuries. Injured cats are a sad example, but they present a very known problem in statistics called a convenience sample. It refers to a sample collected by a third party and is a sample that is highly biased.
Some of you reading this would think this is obvious, there’s no need to discuss it. But Ismail states that this was a real problem found in an article published in the New York Times in 1989 called “On Landing Like a Cat – It is a Fact”. Of course, we can give them the benefit of the doubt that data science was not popular back then, and it’s hard to notice this oversight or misleading pattern if you’re not a statistician or a data scientist. But in data science, this is a term for this kind of misleading statistical pattern – antipattern.
What are antipatterns
Antipatterns are design patterns that if you follow them they take you to wrong conclusions. When building machine learning systems, these design antipatterns are a transition of steps you follow and it’s so automatic that you start ignoring the little details along the way and end up with a failing system, describes Ismail.
So for the ones that actually do machine learning, Ismail gives two important pieces of advice:
- Take into account the bias of the data and use data models
- Start simple and do error analysis.
Why more data is not always better
The first and main reason is because of the lurking bias. It the difference between what we expect the value to be and the actual value of the estimate, clarifies Ismail. So what happens when we collect more data is we get more accuracy, but the bias still remains. And as a consequence, we get very precise very wrong results. The error is very low, but the bias persists.
The solution to this vicious circle is data models: observation model and process model. Ismail gives the example of a group of people that are given a survey to fill in. From the total number of respondees, there are usually people that won’t reply to the survey and they’re not represented in the survey results. Ismail claims that the people that did answer the survey share some similar traits which makes them a biased dataset. This problem can’t be solved by collecting more data in the form of survey replies. Instead, we should model the process so it includes people getting the survey and giving the survey answered. This is what an observation model is. At the same time, there is a process model running because as Ismail explains, from the moment people submit the survey until the moment we analyse the answers, the representative group will have changed their opinions.
Penguins and machine learning
As we’ve seen so far, explaining machine learning processes with animals is very useful, but this time we look at penguins. Let’s imagine that we had the luck to be admitted into the Penguin Research Society and our task is to estimate the penguin population size with the help of satellite images. We will do that by estimating the number of penguins in an image with deep neural networks and inverse reinforcement learning. But the first step is to associate an image whether it “has a penguin” or “doesn’t have a penguin”.
It looks like an easy task on the surface, but it gets more complicated like an iceberg below water. Penguins are smart and they hide underwater and under the snow. So if we don’t see a penguin on the satellite image, it can mean that the penguin is hiding. The image we see is only an observation. So to be more precise, we need to establish that for “doesn’t have a penguin” there are two possibilities. The first one is that it’s true and there really is no penguin. But the second possibility is that it’s false and there is a penguin, but it’s hiding underwater, under snow or we can’t see it for whatever reason.
At this point, we bring in domain experts to help us define the probability of not seeing a penguin vs. knowing that the penguin is swimming vs. the penguin is swimming when we predicted that there’s no penguin. This way we can come up with an observation model and has a certain representation of the bias and control over how biased our data is.
How to choose a machine learning model
“The nature of machine learning is arbitrary, and when dealing with arbitrary things, we are dealing with things that we don’t have any control over,” states Ismail.
One thing we can do though is to ask ourselves one simple question: “If it were easy, what would it look like”, and try to solve it ourselves. If we answer this simple question, then we would be able to choose a very simple model. Although we can’t control machine learning’s arbitrary nature, we can adopt a more scientific approach for the way we design machine learning models and data models. When we come to the point of choosing a model, Ismail suggests to keep it simple and reiterate on it, so we don’t end up with premature optimisation, advises Ismail.
Dealing with skewed data
Skewed data is also frequent in deep learning, especially when used to detect tumour by classifying images as a benign or malignant tumour. If we have a patient and we are analysing their tumour image, we don’t want to tell them they are fine if they have a risk of developing cancer soon. This is an example of a false negative, explains Ismail, and it’s when the prediction is false, but the patient actually has cancer. To solve this issue, calculating accuracy and doing error analysis won’t be enough, and instead, we rely on precision and recall. With precision, we calculate the number of true predictions that are actually true positives over all predictions that we said are true. While recall is calculated with all the things we predicted to be true over all predictions that are actually true. However, Ismail points out that there is a tradeoff between precision and recall, and if you try to improve the one, you are making a tradeoff with the other. In a case like this, we need domain knowledge to precise what we need to focus on.
Some main key points from Ismail’s presentation that are worth remembering are:
- First ask yourself the simple question “If it were easy, what would it look like”.
- Test everything from the data, your assumptions, your models
- Start with a simple model
- Although you can’t control the arbitrary nature of machine learning, you can adopt a more scientific approach with the way you design models.