SafeGraph strives to provide high quality data about the physical world. With that goal, it means we must take accuracy very seriously. But what does that actually mean? There is a lot of talk about accurate data but it's actually very difficult to measure and is often a moving target.
When someone tries to measure accuracy, they are ultimately trying to determine if the data available to them will support what they are trying to do. But some use cases value volume whereas others value specificity.
To assist in allowing our users to measure what is most meaningful for their situation, SafeGraph aims to be very transparent about what Places should ultimately be included in our data and also to provide provide a framework for assessing the accuracy of our Places dataset along two axes:
- Recall: A measure of the quantity of the rows in our dataset
- Precision: A measure of the quality of the entries in our dataset
And here’s how they relate to each other:
First, we have to define what we mean by a place. And then, in our efforts to build the highest recall and highest precision Places product, we must consider the natural tradeoff between pursuing one over the other. The more data you have, the more likely you are to have high recall. However, the more data you have, the harder it is to ensure the precision of each one of those entries. The converse is also true. It is much easier to ensure the accuracy of a small dataset but will that small dataset still meaningfully represent reality? SafeGraph focuses on maintaining an appropriate balance in our approach to growing our dataset and we only want to increase recall when we’re confident in our ability to interpret the new data and maintain a high precision product.
You can check our progress anytime at our Accuracy Metrics page.
The concept of accuracy inherently involves a form of comparison. In order to know that something is accurate, you must be aware of what the correct value should be. This introduces the concept of a truth set.
A truth set, in the context of places or POI data comparisons, refers to a reference dataset that is assumed to be completely accurate and is used as a benchmark to validate the quality of another dataset. The truth set can be viewed as the authoritative record against which other datasets are measured.
While truth sets are extremely valuable, it's important to keep in mind that no truth set can ever be perfectly complete nor perpetually accurate. There is no one golden source that is applicable in all scenarios. Places are subject to change; they open, they close, they move. Sources are often specific to certain regions or types of places. Therefore, regular updates and continuous verification are crucial to maintaining the accuracy and relevance of any truth sets.
SafeGraph leverages or maintains multiple types of truth sets to assist in measuring and improving accuracy. Sometimes it might be a government source. Sometimes it might be Google, the well known industry standard. In many circumstances, we compile our own truth sets via aggregation and manual verification to test our work. You can read more about our Accuracy Metric Methodologies here.
Updated 3 months ago