If recall is a measure of quantity, then precision is a measure of quality. Precision in a places dataset can be analyzed on two fundamental levels: Row Precision and Column Precision. Row Precision refers to the measure of whether a row, representing a unique place, should exist in the dataset at all. And then Column Precision focuses on the accuracy of the values stored within each row. It is a measure of the quality for the individual attributes or data points related to the place. Maintaining high precision in both dimensions ensures that users have access to data that is not only extensive but also dependable.
For any given row in a dataset, precision assesses whether that entry should actually exist. In this case, is it a viable place that we would expect to capture? It emphasizes the legitimacy of the points of interest represented in the dataset which is more important than ever in this digital age of consumer inputted data.
At SafeGraph, we focus on what we call the Real Open Rate to assess our row precision. We sample different portions of our data and manually verify the percent of the rows that reflect a place that is both real and open (i.e. still a functioning business or visitable destination). You can read more about our Accuracy Metric Methodologies, as well as check out our Accuracy Metrics for our current Real Open Rate broken out along a few variables.
Row Precision is ultimately about trusting that the dataset is filled with the right rows. So by extension, it's really a test of the underlying data used to compile that dataset. Not all potential sources are created equal. Many are very specialized, others are very broad but wildly speculative. We rigorously vet the sources we use to understand their relative strengths and weaknesses, and based on that, we determine how and where they should contribute to our conflation engine. Ideally, even for the best sources, we never need to rely on that source alone and can cross-check to confirm the information in at least one additional source.
Strong sources provide reliable and verifiable information. Data from strong sources is given priority as we curate. Examples of strong sources include:
- First-party websites with up-to-date information
- Government databases that are regularly updated.
- Claimed consumer review pages with recent activity.
Weak sources are less reliable and may provide inaccurate or outdated information. Data from weak sources needs to have additional levels of corroboration and verification before we trust it. Examples of weak sources include:
- Unverified consumer review pages.
- Databases that are not regularly updated.
- Potential statuses for a Place
There are many reasons a row should exist in our data and there are multiple reasons why a row might exist when it shouldn’t be there. To address this, we look to assign a status to each entry in our Places dataset. There are three main statuses that a place can have, which are determined by the data available about that place as well as where that data was obtained.
A place is classified as "Open" if we can confirm one of the following:
- The existence of a first-party website
- An address that can be successfully cross-referenced with at least two strong sources of signal.
- A reputable aggregation source with valid signs of recent activity.
These are the rows that we want to populate as much of the dataset as possible, if not the entire dataset.
Example 1: Real and Open:
|222-222@63j-xsr-grk||Cabela's||1650 Gemini Pl||43240|
- Cabela’s Store Locator looks up to date and verifies that the POI exists
- Facebook page is very active with reviews and posts
- Mall directory shows POI
- Phone number (614) 702-2300 is active
Example 2: Real and Open:
|224-22g@5z5-3p7-hwk||Cali Rooter & Plumbing||4470 W Sunset Blvd Ste 578||90027|
- Yelp is claimed with 265 reviews
- Website is functional
- Google page 42 reviews
- Call to phone number revealed they don’t have an actual location
A place is classified as "Closed" if there's evidence across multiple strong sources that the POI has been intentionally marked closed and/or has no recent digital activity and/or has conflicting address/geographic information and/or conflicting name information across “strong” sources.
A place moving from Open to Closed is the most common occurrence of a place no longer contributing to our Open and Real Rate. Removing closed POI as quickly as possible is key to maintaining precision levels. We strive to assign a closed_on date to records when possible, but that level of detail is not always discernible.
Example 3: Closed
|zzw-222@63j-xsr-kj9||T L C Columbus Laser Center||8415 Pulsar Pl Ste 120||43240|
- Website locates business over 12 miles away from listed address (an indication that it closed and moved)
- Phone call to listed number indicated they were no longer at listed location
- POI-type aggregators #1 and #2 locate them at different address
A place is classified as "Unconfirmed" if there's no online presence beyond a single, weak source for the POI or inconsistency across many weak sources. Example: A POI from a movie that is actively being reviewed on social media. These POI are removed from the SafeGraph dataset unless we can better corroborate their status.
A place can also become “unconfirmed” if it disappears from enough sources to drop it below our threshold of verification. In this case, the POI will be removed as well.
At SafeGraph, our approach to row precision is systematic, rigorous and responsive. We revisit all source data at a minimum once per month, often several times within a given month, to ensure we're capturing all recent changes affecting a POI.
We've established a precise algorithm for detecting the opening and closing of POIs, requiring several consecutive source refreshes to validate these statuses. This stringent process allows us to maintain accuracy, even when digital sources may temporarily fail.
During situations like COVID, where places may temporarily close, we're cautious about marking POIs as "closed" until permanent closure can be confirmed from a larger than usual number of consecutive source refreshes. This prevented a lot of churn in identifiers and a broader consistency among actual businesses.
Additionally, we actively hire contractors who live in the areas surrounding our POIs to evaluate the output of our current data against what exists in real-time in the real world. This creates "truth sets" which we use to extrapolate our POI coverage and attribute precision more broadly and work to fill apparent gaps or relative areas of underperformance.
It's worth noting that limiting duplicates also contributes to precision. We use a separate algorithm to detect when two records are describing the same place by comparing geographic proximity, similarity across columns like location_name, street_address, categories, phone_number, website, open_hours, etc. - as well as other unstructured metadata. Just like the tradeoffs in optimizing coverage for precision (and vice versa), there are natural tradeoffs in catching all duplicates versus “over-merging” two places that appear similar but are actually distinct. We are proud to hold a patent in our ability to successfully deduplicate places data and hone our craft month in and month out. We will never be perfect, but with each passing month, our algorithm improves as we chew on a continuously increasing number of data points.
Improving Row Precision is a constant process at SafeGraph. Our methods include:
- Model Improvements: We continually improve our models to better reflect reality. This involves adding additional verification sources and constantly adding training data.
- Routinely vetting sources: Our internal teams and external reviewers routinely vet our sources to ensure accuracy.
- Improving Open/Closed Determination: We're constantly refining what it means for a place to be “open” or “closed". We do this by adding additional verification sources for new and closed businesses and improving the way we cross-check this type of information.
- Investments in Duplicate Detection: As we rapidly add sources, we need to also rapidly detect the overlap between them. And if you've tried, you know that's harder than it sounds.
- Being Super Responsive to Feedback: We value feedback from those using our data and strive to resolve any issues as swiftly as possible.
Column Precision is a metric that is used to measure the accuracy of the values stored within a row of a dataset. This metric helps in assessing the quality and reliability of the individual attributes or data points representing a place. You can refer to our Accuracy Metrics for our latest Coverage rate.
At SafeGraph, we adopt a systematic approach to ensuring column precision. This approach is guided by two key principles:
Focusing on columns where 100% fill rate is meaningful: We prioritize maintaining high precision in columns that are most critical to our dataset. This includes attributes such as:
- Open Hours
- Phone Numbers
- Domains and Websites
Programmatic Verification and Random Sampling: We employ a combination of programmatic verification and random sampling to ensure accuracy of our data.
For instance, we programmatically ensure that each website and URL listed in our dataset returns a successful response. Any URLs returning a 404 error are flagged for review.
For attributes that cannot be programmatically verified, we use random samplings to validate the data.
Improving Column Precision involves maintaining context when assessing any given column or attribute. It's not all about the “fill rates” (how often a column has a non-null value); we aim to achieve a balance between high precision and high fill rates.
The fill rate for a column only tells part of the story; it's also important to ensure the quality of the data. For example, if a global restaurant chain has the same main corporate telephone number listed 10,000 times for all of its locations, this might pass the fill rate test, but it doesn't provide meaningful nor accurate data.
This is why we have chosen to prioritize precision over fill rates where necessary. We strive to ensure you can count on the data we have populated, focusing on delivering valuable and meaningful data to our users.
Updated 3 months ago