Image by TopCoder
Over the last two weeks in this blog, I’ve talked about transparency and trust as part of the fabric of ethical considerations in artificial intelligence. Today I want to continue this series by looking at another key component: ground truth. Having ground truth is so important for the data science process since without it, we won’t be able to make accurate deductions or evaluations.
Since I’m an aspiring data scientist and not a philosopher, we’ll be looking at the term ‘truth’ from a technical sense. The generally agreed upon definition of ground truth as explained nicely by TopCoder is ‘factual data that has been observed or measured and can be analysed objectively.’ He goes on to explain that ground truth data is the opposite of data that is assumed, subject to someone’s opinion or up for discussion. In other words, the ground truth is another term for what you may have learned in science classes as empirical evidence.
Without good starting baseline data, the models will be inaccurate and not be helpful to the problem that’s trying to be solved or questions that are trying to be answered. If we don’t have ground truth, according to Scott Krig in the paper Ground Truth Data, Content, Metrics and Analysis, ‘there is no way to improve what we cannot measure and do not understand.’ Simply put having a ground truth will help us make a better data analysis. Clinton Bonner in a LinkedIn article also mentions that the ability to problem solve with data science depends on our ability to frame the problem after we have the ground truth.
In addition to have better input data for models, the ground truth helps us avoid wrongly fit the facts to our theories before we have the data. Sherlock Holmes thoughtfully said, ‘It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit fact.’ A challenge when beginning a project is to figure out if the data is a good fit for the problem at hand. For example, is the ground truth data detailed enough to find patterns using an algorithm? I think this is a challenge all practitioners probably face at one time or another. Having ground truth though, is a key component to the entire data science analysis.