Accuracy Metric Methodologies

We know measuring accuracy isn't easy. There are both theoretical and practical challenges to doing it effectively, and we want to be very transparent with how we are arriving at the numbers that we highlight on our Accuracy Metrics page and throughout our docs site.


Why Sampling?

Because no single golden source truth set exists for places data. And it's not feasible to repeatedly compile a manual truth set for too large an area in a timely manner. So we must use a sample of some kind. We want our sample to be both "random" (so that we aren't gaming any system) but also representative (so that it is meaningful).

So we choose two zip codes each quarter - one urban and one rural, in an effort to provide meaningful examples within our dataset. And we report metrics for all the data compiled but also broken out in a few meaningful ways. You can find those metrics here: Accuracy Metrics.

Defining Representative Zip Codes

Each quarter, we randomly pick one zip code that meets the urban criteria and one that meets the rural criteria.

The urban zip code is deemed representative if it is:

  • In the top 30% of zip codes ranked by population density
  • Contains at least 2,500 POIs in the SG dataset

The rural zip code is deemed representative if it is:

  • In the bottom 50% of zip codes ranked by population density
  • Contains at least 300 POIs but no more than 2,500 POIs in SG dataset

Why cycle zip codes each quarter?

This approach provides a broad, representative sample over time, rather than focusing too much on any specific area. We randomize the selection within the above representative bounds.

We invite our users to submit zip codes for future release checks! Our goal is to provide the most accurate, reliable data possible - we're not looking to game the system, but to accurately communicate the integrity of our data.


How are we benchmarking?

To determine accuracy, you need a comparison point. A single dataset in a vacuum does not allow you to assess what should be in it and what shouldn’t be. Again, ideally that benchmark is a perfect truth set for all possible values. But that doesn't exist for places.

So we need a more readily available, practical benchmark. And at the moment, for our purposes in the United States, there is no better known benchmark than Google who is routinely deemed the industry standard.


Real Open Rate

Our precision metric focuses on the 'Real Open Rate'. This ratio expresses the proportion of entries in the dataset that have been marked 'real' and 'open' according to our definitions compared to all of those found to be real and open through manual review. The higher the ratio, the more entries are verified. This is a measure of quality and accuracy for each entry in the dataset.

Equation

Real and Open SafeGraph POI / Total SafeGraph POI

Approach

We engage a third party to review all of the SafeGraph places in two representative zip codes and identify those that should and should not be present. They can leverage whatever means necessary to arrive at their assumption. This often includes web searches, social media reviews, URL validation or often times just calling the place. The percentage that SafeGraph has labeled open out of the total that they find should be labeled open reflects the accuracy and reality of the data.


Coverage Rate

For recall, we measure the 'Coverage Rate' of our dataset using the most well known industry benchmark in the US: Google. This ratio compares the total count of real and open SafeGraph POIs to the total count retrieved from Google (also reviewed to be real and open). This percentage reflects the total coverage of "reality" (again using Google as a practical benchmark).

Equation

Real and Open SafeGraph POI / Real and Open Google POI

Approach

We engage a third party to review all of the SafeGraph places in two representative zip codes and have them identify those that are both "real" and "open" using any means necessary (if that sounds familiar its because its the exact same numerator as the Real Open Rate. We can reuse the same reviewed data). We also have the external party review all of the places returned by Google in the same zip codes (derived via bounding box). They identify which entries are real and open as they do with the SafeGraph POI and then counts are compared. Note this doesn't directly account for any entries that show up in one dataset but not the other, which is why we also show a coverage overlap analysis for context.


Category Aggregation

We engage a third party to review all of the Google places in two representative zip codes and have them assign them to a category bucket which aligns to the NAICS framework at the two digit level (e.g. 72* for Accommodations and Food Services). Because Google only provides tags and does not provide a standardized category assignment, this is the best way to compare all places against the NAICS assignments available for SafeGraph in the same two representative zip codes. Where relevant, the third party can leverage the category metadata that Google provides to assign the appropriate category bucket.


Coverage Overlap

This exhibit merely shows the output of a match run between places found in the sample zip codes from each provider. The output is shown as a percentage of places that fall into each of the three buckets:

  • SafeGraph Only
  • Both Providers (the "Overlap")
  • Google Only

This review is helpful in understanding how close the datasets are to representing the same reality and also good context when considering coverage rates, which can skew artificially high if the overlap between two datasets is low.