April-2019 Release Notes (v2019-03-29)
We know these improvements are amazing, and we'd like to assure you this is not an April Fool's joke.
Core Places and Brands
Enhancements - Core Places and Brands
-
Last month SG Places had 4,774,401 places. This month SG Places has 4,779,045 places (net + 4,644 places). 📈
-
We've added 381 new brands 🎊 including:
- ampm (ampm.com, SG_BRAND_db57f3767efde48f) with 1103 places.
- Hardee's Red Burrito (hardees.com/redburrito, SG_BRAND_a9fd3c6e57be83ce) with 296 places.
- Rosati's Chicago Pizza (rosatispizza.com, SG_BRAND_944d5e24cfc93afa) with 193 places.
- Goodyear Commercial Tire & Service Center (goodyearctsc.com, SG_BRAND_7bc3d0016a136896) with 173 places.
- and 377 more!!
-
Significantly improved entity resolution and de-duplication 🔀
- Impact: This improved technology led to the discovery and removal of ~ 160,000 duplicate POI from our dataset. We also discovered ~120,000 POI that were being incorrectly merged but are actually distinct places.
- Details: Figuring out if two data records from two different sources refer to the same place is one of the core challenges we work on at SafeGraph. We’ve made multiple improvements on our deduplication technology in this release. The largest improvements in accuracy came from better feature engineering — especially for comparison features between POI names. For example, our original model struggled with POI names which looked similar at the front but different at the back, e.g. “AT&T” and “AT&T Authorized Retailer”. Now our features recognize that these POI names are relatively similar, even though the latter includes significantly more letters than the former.
-
Brand Names are now case-smart and canonical
As known to all fastidious SafeGraph customers (which are almost all SafeGraph customers), historically SafeGraph brand names and branded-location names have not always been formatted in a standardized manner. (We've all seen a location for SG_BRAND_f116acfe9147494063e58da666d1d57estarbucks
right next to a location for SG_BRAND_8e66c99aa833dd0ced592ee5ba50e743EILEEN FISHER
and wondered... why is one lower case and the other upper case? ). Now all brand names are case-smart and canonical (i.e. the name and casing a consumer would expect). 131 brands changed their names, the full list of changes is documented here..
SELECT
M.safegraph_brand_id,
M.brand_name as brand_name_March2019,
A.brand_name as brand_name_April2019
FROM brand_info_march2019 M
LEFT JOIN brand_info_april2019 A
ON M.safegraph_brand_id = A.safegraph_brand_id
WHERE M.brand_name <> A.brand_name
ORDER BY RAND()
LIMIT 6
Results:
safegraph_brand_id | brand_name_March2019 | brand_name_April2019 |
---|---|---|
SG_BRAND_64a77880c7f7c1d3133d10e574c97a8b | kohls | Kohl's |
SG_BRAND_962f9b1d1de0bf5b87f4782eafcfd5e5 | wendys | Wendy's |
SG_BRAND_b581ece69c7ca08c57e57d8aa919224d | L'OCCITANE | L'Occitane |
SG_BRAND_2c9fcf03e737a9c4f882534ef6a57b8c | BALLSTON SPA NATIONAL BANK (BSNB) | Ballston Spa National Bank (BSNB) |
SG_BRAND_24fdc423822298896dcd7ae0548f1498 | UNIQLO US | Uniqlo |
SG_BRAND_3d459942728f7a636ce726527858d8f8 | SAINT LAURENT PARIS | Saint Laurent |
Bugs and Known issues - Core Places and Brands
- Bad SGPID Churn -- Bad sgpid churn are undesired failures to maintain consistent safegraph_place_ids (sgpids) between releases (see discussion in March 2019 release. We internally track and estimate our performance in this domain and share these numbers in our release notes for maximum transparency. In the April-2019 release
- We dropped 295,132 sgpids (127,945 branded and 167,187 non-branded).
- We added 299,776 sgpids (89,542 branded and 210,234 non-branded).
- Some percent of these are true openings and closings; the remainder are bad sgpid churn. We are working on better metrics for distinguishing the two cases.
- NB: These numbers are much higher than last month due to (a) the improved de-duplication described above and (b) an internal overhaul and re-factoring of some of our most important sources of data for branded POI. Despite our best efforts, this refactor caused more instability than our average release.
- Category Fill Rate We monitor category fill rate with 3 metrics: (1) category fill rate across the entire dataset, (2) category fill rate for branded POI, (3) category fill rate in the brand_info file (brand-level categories). We want all of these numbers to be 100%.
- (1) All POI category fill rate. Last month 91%. This month 91%.
- (2) Branded POI category fill rate. Last month 98%. This month 100% 💯
- (3) Brand-level category fill rate (brand_info file). Last month 84%. This month 99%. 📈
Geometry
Enhancements - Geometries
- Improved and additional cartography and polygons. New or improved polygon geometries :diamond-shape: for over 10,000 POI, including many new- and used-auto dealerships. The goal is for auto dealerships to consistently include the outdoor lot areas since this is more representative of the place of interest than just the indoor building (and this is the preferred polygon for visit attribution use cases). We are not finished with dealerships but we made significant progress this month
For example, Auto Ranch 311 G Street North West, Ardmore, OK, 73401(sg:c76ee3bfa5a8440b804250d1f0fe52c0
) is a small used-car dealership. SafeGraph Places polygon_wkt
now includes the entire lot for this business instead of just the smaller building.
Bugs and Known issues - Geometries
- Centroid-Radius Polygons -- As discussed in March 2019 release notes. We internally track centroid-radius polygons vs precise polygons and strive for 100% precise polygons.
Here is how we are tracking on that metric over the last few releases.
SafeGraph Places Version | Percent Precise Polygons |
---|---|
April 2019 (v2019-03-29) | 92.8 |
March 2019 (v2019-02-28) | 92.8 |
February 2019 (v2019-01-30) | 92.7 |
January 2019 (v2018-12-20) | 90.9 |