October-2020 Release Notes

Welcome to October! ๐Ÿ‚ Pour yourself a pumpkin spice latte and peruse the latest SafeGraph news (2020-09-27/1601204526 shipped 2020-10-07).

Highlights

  • placekey identifier for every record in SafeGraph Places ๐Ÿ”‘
  • A massive update to our deduplication model courtesy of SafeGraph Machine Learning ๐Ÿ‘ฏโ€โ™‚๏ธ:boom:
  • Big time brand additions ๐Ÿ’ฏ

Table of Contents:

  • We are thrilled to announce the co-founding of the Placekey initiative alongside of some influential industry leaders. The diverse group of partners supporting Placekey demonstrates the need for a common language across geospatial data. That's why Placekey's mission is to unlock geospatial data through a free, universal identifier for any physical place, so that the data pertaining to those places can be easily searched, shared, and joined to other places datasets.

  • Placekey is immediately available in the U.S. with plans to include other countries in the near term.๐ŸŒŽ Learn more about placekey design in the Places Manual.

  • Join us at the Placekey launch event on October 7th to hear from industry leaders, learn about Placekey applications, and to get started using Placekey with your own datasets. ๐Ÿ”

  • As a founding member, we now include placekey as the first column in all SafeGraph Places products in the U.S., and it will eventually replace safegraph_place_id as the unique and persistent ID for each record in our dataset. Don't worry - we will give plenty of notice before phasing out safegraph_place_id :loudspeaker:

Enhancements - Core Places and Brands

  • After months of logic refinements, training data updates, and tedious QA, we are delighted to introduce a new deduplication model! :tada:

  • Deduplication is a core competency for SafeGraph so that we can continue ingesting new POI sources with the assurance that we're not adding redundancy to the data. This model operates "behind the scenes" comparing millions of pieces of metadata to make determinations about whether two POIs are the same or distinct.

  • Last month SG Places had 5,901,528 points-of-interest. This month SG Places has 5,933,243 points-of-interest (net +31,715 places). These are +27,151 US Places and +4,564 CA places.

  • We've added +83 brands including +8 Full-Service Restaurants ๐Ÿด and +24 Canada only brands ๐Ÿ
    New Brands Include...

    • Canada Post ((canadapost.ca), SG_BRAND_56f0efdcdba9479b) with 0 US and 1,780 CA places.
    • Bank of Montreal ((bmo.com), SG_BRAND_4f2e92d4217368f3) with 0 US and 880 CA places.
    • M&M Food Market ((mmfoodmarket.com), SG_BRAND_621a88a676c25d0a) with 0 US and 622 CA places.
    • Kroger Fuel Center ((kroger.com/fuel), SG_BRAND_f29bd2583c8336be), parent brand: (Kroger, SG_BRAND_1f852a23da4b7250) with 989 US and 0 CA places.
  • Barnes and Noble College ((bncollege.com), SG_BRAND_d1e609cfd201e1de), parent brand: (Barnes and Noble, SG_BRAND_0031e43e4f12b969239801d340f7c141) with 443 US and 0 CA places.

    • and 78 more!

Bug Fixes and Known Issues - Core Places and Brands

  • We discovered a few brand count fluctuations as a result of updated sourcing and other metadata bugs. These corrections resulted in significant changes in the total number of POIs for each affected brand, but the new count is correct. For transparency, we'd like to list some of these corrections as examples in no particular order:

    • Southern States Cooperative (SG_BRAND_b171fe50c25853ab56e1f134afc569ac). Net POI count change: US: -35 CA: 0. Bug: Filtered out Dealers like Ace Hardware.
    • FNB Bank, N.A. (SG_BRAND_f541bcb6ec62de093416a7c8de510e84). Net POI count change: US: -133 CA: 0. Bug: Previously included ATM-only locations.
    • Best Buy (SG_BRAND_2c648ef84225e10f0499e7d255eacf71). Net POI count change: US: -64 CA: 0. Bug: Previously included child brand Pacific Sales & Magnolia Home Theater.
    • La Petite Academy (SG_BRAND_9fa908d38c44268e388fb1976738aed7). Net POI count change: US: -82 CA: 0. Bug: Previously included affiliates (e.g., Tutor Time).
    • True Value Hardware (SG_BRAND_0b1f746a0c413ffd). Net POI count change: US: -914 CA: 0. Bug: Previously included affiliates (e.g., Ben's Super Center).

Enhancements - Categories

  • We received positive feedback on our recent scope expansion into industrial and have added +~2,500 small manufacturing sites this month. ๐Ÿšง ๐Ÿ—
  • These POIs are not "branded" and can be found in the "Other Miscellaneous Manufacturing" top_category description (naics_code = 3399). Reference the Places Manual for a details on where to find all industrial POIs.

Category Fill Rate -- We monitor category fill rate with 3 metrics: (1) category fill rate across the entire dataset, (2) category fill rate for branded POI, (3) category fill rate in the brand_info file (brand-level categories). We want all of these numbers to be 100%.

  • (1) All POI category fill rate. Last month 99.2%. This month 99.2%.

  • (2) Branded POI category fill rate. Last month 100%. This month 100% :100:

  • (3) Brand-level category fill rate (brand_info file). Last month 100%. This month 100% :100:

  • See the August Release Notes for details on our new and improved category ML model. ๐ŸŽ‰

Drops โฌ‡๏ธ

  • We constantly ingest data from new sources, and many safegraph_place_ids (sgpids) are intentionally dropped, but we are unable to track each and every dropped sgpid. In this release:

    • We dropped 61,616 sgpids (23,793 branded and 37,823 non-branded).
    • 6k dropped due to POI source changes
    • 1,493 dropped as a result of bug fixes for branded POIs :bug:
    • 44k dropped as a result of deduplication - largely due to implementing the new model ๐Ÿ‘ฏโ€โ™‚๏ธ
    • 1,374 dropped due to permanent closures :x: (dropped but not lost -- you will still see these POIs if you get the open/close columns).
  • The remaining drops are undesired failures to maintain a consistent sgpid between releases - known as bad sgpid churn (see discussion in March 2019 release). We are continuing to work on better metrics to distinguish good vs. bad churn.

Enhancements - Geometry

  • While OWNED polygons are preferred, it does not mean that SHARED polygons are inherently bad. It only means that the exact shape of each POI within the polygon is not discernible, but the general location can be identified by the centroid (latitude & longitude). ๐ŸŽฏ

  • When enclosed = FALSE, it indicates that there are reasonable means to derive a unique polygon for the POI (even when parent_safegraph_place_id is not null), and we strive for 100% of branded, non-enclosed POIs to have polygon_class = "OWNED_POLYGON."

  • Last month, the percent OWNED polygons for branded, non-enclosed POIs was 78.8%

  • This month, the percent OWNED polygons for branded, non-enclosed POIs is 80% :chart-with-upwards-trend:

Bug Fixes and Known Issues - Geometry

  • Centroid-Radius Polygons -- As discussed in March 2019 release notes. We internally track centroid-radius polygons vs precise polygons and strive for 100% precise polygons. You can measure this yourself using the is_synthetic column.
    • This release, we saw a slight decrease to 95.8% precise polygons (95.9% last month)
    • Here is how we are tracking on this metric across releases: Centroid-Radius Polygon Tracking.

Enhancements - Patterns

  • In last month's delivery, SG Patterns had 4,078,861 points-of-interest (US only). This month, SG Patterns has 4,095,560 points-of-interest (US only) (net +16,699). :chart-with-upwards-trend:

  • Last month, SG Patterns had 868,811,661 visits from 35,455,162 visitors. This month, SG Patterns has 850,573,530 visits from 35,756,143 visitors (delta -18,238,131 visits, +300,981 visitors).


**In case you missed it,** check out [last month's release notes](https://docs.safegraph.com/changelog/september-2020-release-notes). ๐Ÿ“

**Calculating Diffs**
Curious to find the specific records that were either **added, deleted, or saw an attribute change** from one release to the next? Visit "Calculating Diffs" in our [Data Science Resources](https://docs.safegraph.com/docs/data-science-resources#section-calculating-diffs) to get started. 

**Fill Rates**
See the [Summary Statistics](https://docs.safegraph.com/docs/places-summary-statistics) page for all Core and Geometry column fill rates as well as a breakdown of POI count by `naics_code`.

**Also check out these new ways to get SafeGraph data: **
  * Need some extra data or other SafeGraph products? Check out the [SafeGraph Data Bar.](https://shop.safegraph.com/) 
  * Heavy AWS User?  Check out our [listings in the AWS Data Exchange](https://aws.amazon.com/marketplace/search/results?filters=vendor_id&vendor_id=7d5ff8ca-105f-4856-9d99-5f2f1d83223c).
  * Are you an Esri or ArcGIS user? Check out our FREE data [SafeGraph Places in the Esri Marketplace](https://marketplace.arcgis.com/listing.html?id=3425348e4bee4059af2b353e52df43c2) and enjoy [SafeGraph Places in Esri Basemaps](https://www.esri.com/arcgis-blog/products/arcgis-living-atlas/mapping/new-places-in-esri-vector-basemaps/). 
  * Snowflake user? Check out our page on the [Snowflake Data Exchange](https://www.snowflake.com/datasets/safegraph/) :snowflake: 
  * Or just drop us a line! Your data needs are our data delights!