FAQs
SafeGraph General
- Which records were added, deleted, or changed since last release?
- How do I work with SafeGraph data in Spark?
- What are you using for MSAs in the Shop?
Places
- How often is SafeGraph Places updated?
- How should I match SafeGraph Places with existing internal POI data?
- How do I use SafeGraph Places in ESRI?
- How does SafeGraph assign NAICS code to points of interest?
Geometry
- How can I visualize the
polygon_wkt
from SafeGraph Geometry? - BigQuery does not like my polygons?
- What coordinate reference system does SafeGraph use for its centroid and polygon coordinates?
Spend
- Is the Spend Transactions Panel the same as the Patterns Mobile Device Panel?
- Which kinds of cards are included in the Spend dataset? Is it only one type of card?
- How do I unzip .csv.gz files?
Patterns / Weekly Patterns / Neighborhood Patterns
Legacy Product
This page references SafeGraph Patterns, Weekly Patterns, and/or Neighborhood Patterns, legacy products that will no longer be available at the start of 2023. If you are interested in foot traffic data, please contact us and we can refer you to a mobility data partner.
- How often is Patterns data updated?
- What version of census block groups does SafeGraph use for the Patterns products?
- How does SafeGraph apply the census block group idea for Canada?
- How do I aggregate census block groups to zip codes?
- How do I work with the Patterns columns that contain JSON?
- Is there a SafeGraph SDK?
- Do you have historical Patterns data?
- In Weekly / Neighborhood Patterns, why do visits / stops show spikes at 8pm ET / 5pm PT?
- What is the breakdown of SafeGraph’s mobility device panel? Is there a skew toward certain demographics?
- What are the differences between SafeGraph's US and Canada mobility panel, if any?
- How does Patterns data deal with visits to POIs within a multi-floor building?
- Why are Patterns visits so low to this one particular place?
Which records were added, deleted, or changed since last release?
- See the "Calculating Diffs" section of Data Science Resources for plug and play code to answer all of your burning diff questions.
How do I work with SafeGraph data in Spark?
-
If new to Spark, check out this quick intro to Spark.
-
If using Scala Spark, make sure to use
.option("escape", "\"")
when reading in the data. So, you would read in the data like this:
val df = spark.read.option("header", "True")
.option("escape", "\"")
.csv("/PUT YOUR PATH HERE")
- If using python/pyspark, read in the data as follows:
df = spark.read.option("header", True).option("escape", "\"").csv("/PUT YOUR PATH HERE")
What are you using for MSAs in the Shop?
- You might have noticed that you can order data by Metropolitan Statistical Area in the SafeGraph Shop.
- The MSAs are defined here.
How often is SafeGraph Places updated?
- SafeGraph issues updates to Places once per month, which is much more frequently than other POI vendors, who may update once every 3-6 months.
- We can do this because we work with more sources of data and are much more efficient at combining those sources of data. During each month, some subset of our sources will send us their updates, and we ensure that we onboard and integrate those changes quickly and easily.
- This enables us to quickly reflect store openings and closings in our Places database.
- The time between a store opening / closing and being reflected in our Places database is approximately equal to the time that that store update is seen by one of our sources + the time it takes SafeGraph to reflect this in our data.
- The latter of these two is typically within the month -- which is very fast compared to competitors, which might be within 3 months.
- However, the former of these two is hard to predict -- but we do work with sources that generally receive updates very quickly.
How should I match SafeGraph Places with existing internal POI data?
- Matching place data is very difficult. Some places will match immediately (i.e. store name, address, zipcode, etc. are exactly the same), but the majority of places will not match. Is "peets coffee" at "345 5th street" in our database the same as "Peet’s Coffee & Tea" at "357 fifth st." in another database? Basic exact matching will not match these two, so your team will need to have built out advanced deduplication logic or else you will notice significant discrepancies.
- SafeGraph offers a Matching Service that we recommend utilizing for this purpose. Please contact us if you're interested!
How do I use SafeGraph Places in ESRI?
First, friendly reminder that Patterns does not have any geospatial data on its own. If you want to do geospatial analysis, you should augment these datasets with Geometry, which contains a latitude and longitude coordinate for every POI.
Visualizing POI as point data in Esri
Let's say your goal is to visualize a point for each POI on a map and have the Patterns data available in the pop-up in ArcGIS Online (AGOL).
- First, load the SafeGraph csv file into AGOL. Make sure your data includes lat/long (any data cut that includes PLACES or GEOMETRY will). Instead of "Locate by Address or Places" select "Coordinates" and make sure latitude and longitude are mapped correctly (it should auto-detect this). * This should load successfully.
- Open the SafeGraph data in a map in AGOL.
Visualizing POI as polygons in Esri
There are a few methods to take the data in polygon_wkt
in SafeGraph Geometry and visualize the data. Unfortunately, ArcGIS Online cannot natively read the polygon_wkt, so you will have to convert it.
This Google colab notebook illustrates a best practice for converting SafeGraph Geometry files to Esri SHP files using geopandas.
Alternatively, if you are working with arcpy, you can convert a WKT to a ArcGIS Polygon Geometry using the fromwkt() function. If none of these are meeting your workflow needs, we recommend contacting Esri support to develop a workflow solution. See Also: Visualizing WKT.
What coordinate reference system does SafeGraph use for its centroid and polygon coordinates?
WGS84, also referred to as EPSG:4326.
How does SafeGraph assign NAICS code to points-of-interest?
- We strive to assign each point-of-interest the most reasonable, sensical and appropriate NAICS code. We have a multi-prong approach. We use human-experts to label NAICS to brands. We use the business name as an indication of its category. We have also crawled extra open-source information about a point-of-interest to infer the most correct NAICS code. We use a deep neural network model to match long tail POI to NAICS based on name and other data points we have crawled.
- Note that most data that SafeGraph curates and reports have objective truth, like
zip_code
orvisits_by_day
. In contrast, there is no objective truth for NAICS code. NAICS are detailed descriptive categories created by governments, but they do not perfectly describe every business. There are many examples of a point-of-interest that reasonably fits into multiple NAICS or does not fit into any NAICS very well. In these cases we strive for the "most correct" answer. - If you see a NAICS code that doesn't make sense to you, let us know!
How can I visualize the polygon_wkt
from SafeGraph Geometry?
polygon_wkt
from SafeGraph Geometry?If you are proficient with Esri tools, then you have some options in Esri.
If you are not familiar with any GIS tools and just looking for some quick and easy visualizations, we recommend Kepler.gl. You can upload a SafeGraph CSV directly into Kepler and see points and polygons within seconds.
BigQuery does not like my polygons?
- We have found that running the
ST_GEOGFROMTEXT
function in Google's BigQuery on our full dataset will return an error--ST_GeogFromText failed: Invalid polygon loop
. This is caused by only a handful of our polygons (under 20) not playing well with BigQuery. We have not encountered this issue with other geo libraries. - So that this does not stop you generally from calling this function on the polygons, use
SAFE.ST_GEOGFROMTEXT(wkt)
. This will result in your function running and the few problematic polygons will just returnNULL
. - We are looking into a solution so that this error does not occur at all.
What version of census block groups does SafeGraph use for the Patterns products?
- SafeGraph uses the 2010-2019 version of the census block groups for the U.S., specifically the 2016 vintage.
- You can find more information and a link to download the U.S. census block group geometries on our Open Census Data page!
How does SafeGraph apply the census block group idea for Canada?
- For Canadian entries in any cbg column (e.g.,
poi_cbg
orvisitor_home_cbgs
), we use the Canadian Dissemination area designations (Canadian units haveCA:
as a prefix)
How do I aggregate census block groups to zip codes?
- Check out some of the awesome data science resources we have on our Github page. If you search "zip" on that page, you'll find examples in Python and in R.
Is the Spend Transactions Panel the same as the Patterns Mobile Device Panel?
- No, they come from completely different sources and therefore are unrelated.
- The transactions data do not come from mobile devices; this is what allows us to have robust Spend data for indoor locations which are challenging for mobile GPS signals.
Which kinds of cards are included in the Spend dataset? Is it only one type of card?
- We're not able to say exactly which credit card brands are included, but the panel includes both debit cards (i.e., bank cards) and credit cards.
- The panel is also not all from one particular brand, e.g., not all Mastercard or Visa.
- Usually this question is asked with the purpose of understanding the representativeness of our panel. If this is a concern, please see our material on Quantifying Geographic Bias comparing our panel to the census (Average bias < 1% with a maximum of +/-4% per state).
How do I unzip .csv.gz files?
- .gz files are not regularly zip files, but are gzipped.
- If you use a mac, you can use the gunzip utility in terminal to unzip files. See more here.
How often is Patterns data updated?
- Monthly Patterns and Neighborhood Patterns data for the previous month is available on the ~7th of the next month (e.g., Data for the month of September is typically available October 7th or earlier).
- Weekly Patterns data for the previous week is available on the Wednesday of the next week (e.g., Data for the week spanning Mon Sept 1 to Sun Sept 7 is typically available on Wed Sept 10 or earlier).
How do I work with the Patterns columns that contain JSON?
- We have a simple web app for exploding the JSON here. You can explode it horizontally (into more columns) or vertically (into more rows). Just upload your file and pick which columns you want exploded. This is a quick and easy solution if you have a file with 1k or fewer rows (about 1MB) and do not want to explode beyond 20k rows.
- If you ❤️ Excel, we have an add-in that you can install to parse the JSON columns. The add-in can be downloaded here. See video demo of installation and usage. Written instructions are here. ⚠️This is only recommended for small samples of the data (100 rows or so)!
- We also have a SQL example if you use Snowflake. See Tutorial here.
- Want more control?
- To horizontally explode the JSON into more columns programmatically, see an example using pandas here.
- To vertically explode the JSON into more rows programmatically, here are some code examples using PySpark, Scala Spark, pandas, R, and SQL (click tabs):
"""
This code takes SG Patterns data as a PySpark DataFrame
and vertically explodes
the `visitor_home_cbgs` column into many rows.
The resulting dataset has 3 columns:
safegraph_place_id, visitor_count, visitor_home_cbg.
"""
from pyspark.sql.functions import udf, explode
from pyspark.sql.types import *
import json
def parser(element):
return json.loads(element, MapType(StringType(), IntegerType()))
jsonudf = udf(parser, MapType(StringType(), IntegerType()))
visitor_home_cbgs_parsed = df.withColumn("parsed_visitor_home_cbgs", jsonudf("visitor_home_cbgs"))
visitor_home_cbgs_exploded = visitor_home_cbgs_parsed.select("safegraph_place_id", explode("parsed_visitor_home_cbgs"))
display(visitor_home_cbgs_exploded.selectExpr("safegraph_place_id as safegraph_place_id", "key as visitor_home_cbg","value as visitor_count"))
import org.apache.spark.sql.functions._
import play.api.libs.json._
def parser(element: String) = {
Json.parse(element).as[Map[String, Int]]
}
val jsonudf = udf(parser _)
val converted = df.withColumn("parsed_related_same_day_brand", jsonudf($"related_same_day_brand"))
display(converted.select($"safegraph_place_id", explode($"parsed_related_same_day_brand" as "exploded_related_same_day_brand")))
val visitor_home_cbgs_parsed = df.withColumn("parsed_visitor_home_cbgs", jsonudf($"visitor_home_cbgs"))
display(visitor_home_cbgs_parsed.select($"safegraph_place_id", explode($"parsed_visitor_home_cbgs" as "exploded_visitor_home_cbgs")))
"""
If you are working with large datasets (i.e. > 20,000 POI at a time),
then you should consider the Python-Pyspark solution;
it is much much more efficient).
This code takes SG Patterns data as a pandas DataFrame
and vertically explodes
the `visitor_home_cbgs` column into many rows.
The resulting dataset has 3 columns:
safegraph_place_id, visitor_count, visitor_home_cbg.
"""
import pandas as pd
import json
patterns_df = pd.read_csv("safegraph_patterns_data.csv")
# convert jsons to dicts
patterns_df = patterns_df.dropna(subset = ['visitor_home_cbgs'])
patterns_df['visitor_home_cbgs_dict'] = [json.loads(cbg_json) for cbg_json in patterns_df.visitor_home_cbgs]
# extract each key:value inside each visitor_home_cbg dict (2 nested loops)
all_sgpid_cbg_data = [] # each cbg data point will be one element in this list
for index, row in patterns_df.iterrows():
this_sgpid_cbg_data = [ {'safegraph_place_id' : row['safegraph_place_id'], 'visitor_home_cbgs' : key, 'visitor_count' : value} for key,value in row['visitor_home_cbgs_dict'].items() ]
# concat the lists
all_sgpid_cbg_data = all_sgpid_cbg_data + this_sgpid_cbg_data
home_cbg_data_df = pd.DataFrame(all_sgpid_cbg_data)
# note: home_cbg_data_df has 3 columns: safegraph_place_id, visitor_count, visitor_home_cbg
# sort the result:
home_cbg_data_df = home_cbg_data_df.sort_values(by=['safegraph_place_id', 'visitor_count'], ascending = False)
# This code takes SG patterns data as a
# data.frame (or, even better, a data.table)
# and vertically explodes the `visitor_home_cbgs`
# column into many rows, or the `visits_by_day` column.
# This results in one row for safegraph_place_id
# one for visitor_count
# and one for visitor_home_cbg/day
# if you don't have the SafeGraphR package:
# install.packages('remotes')
# remotes::install_github('SafeGraphInc/SafeGraphR')
library(SafeGraphR)
# Generally, data.table::fread is preferred to read.csv
# but this is fine for small files
patterns_df <- read.csv('safegraph_patterns_data.csv')
# expand_cat_json expands categorical JSON variables like visitor_home_cbg
home_cbg_data_df <- expand_cat_json(patterns_df,
expand = 'visitor_home_cbgs',
index = 'origin_cbg',
by = 'safegraph_place_id')
# Fix variable names
names(home_cbg_data_df)[names(home_cbg_data_df) == 'visitor_home_cbgs'] <- 'visitor_count'
names(home_cbg_data_df)[names(home_cbg_data_df) == 'origin_cbg'] <- 'visitor_home_cbgs'
# expand_int_json expands integer JSON variables like visits_by_day
day_data_df <- expand_int_json(patterns_df,
expand = 'visits_by_day',
index = 'day',
by = 'safegraph_place_id')
# Fix variable names
names(day_data_df)[names(home_cbg_data_df) == 'visits_by_day'] <- 'visitor_count'
/*
This code explodes the popularity_by_hour column into rows.
The result will be one row per hour per POI (i.e., 24 x NUM_POIs rows).
The table name here will work for the Starbucks Free Sample data in
Snowflake. Replace with your own table name as desired.
Note also that we take the INDEX value from the expl table below because the
popularity_by_hour column has no keys. For other JSON columns with keys,
replace expl.INDEX with expl.KEY.
*/
WITH exploded AS (
SELECT
patterns.placekey
,patterns.location_name
,patterns.city
,expl.INDEX as hour
,expl.VALUE as visits
FROM
(SELECT * FROM STARBUCKS_PATTERNS_SAMPLE.PUBLIC.PATTERNS) as patterns,
TABLE(FLATTEN(input => PARSE_JSON(patterns.popularity_by_hour))) as expl
)
SELECT *
FROM exploded
Is there a SafeGraph SDK?
No. SafeGraph does not have an SDK or any software for that matter that can make it into mobile apps.
Do you have historical Patterns data?
- Yes! We have Patterns data going back to January 1st, 2018 for the US and back to January 1st, 2019 for Canada. Beyond that, please contact us.
- In order to successfully compare the data over time, we encourage normalizing based on our panel size over time. Each monthly and weekly delivery of Patterns includes the Patterns to enable this normalization. Please see our Data Science Resources for guidance on how to go about doing this.
- Please note that the underlying Places (i.e., Places + Geometry) data used to create Patterns changes over time. Ongoing releases will always be using the latest version of Places: for example, all Patterns data from Jan 2021 onward will encompass the new POIs added to Places in Jan 2021 (i.e., industrial POIs). This also means that historical Patterns data will not contain new POIs until historical Patterns are re-generated with new versions of Places, which is generally done no more than twice a year. See the Backfill Key Concept in Patterns for more details.
- Below is a reverse-chronological breakdown of the Places releases used to backfill Patterns for past releases. This is provided for transparency only - once a new backfill is released, we advise to always use the latest re-generated version of historical Patterns (except in very rare situations), as that will be most consistent with ongoing releases (in terms of schema, POI, bug fixes, etc.):
- Patterns provided/delivered July 2021 onward:
--Activity from January 2018 through and including July 2021 was generated using the July 2021 release of Places.
--Activity from August 2021 onwards will be based on the Places release of the same month as the activity (so August 2021 activity will use the August 2021 Places release). - Patterns provided/delivered between December 2020 and June 2021:
--Activity from January 2018 through and including December 2020 was generated using the Dec 2020 release of Places. This was the first historical delivery that considers point-in-time POI openings/closures. For example, if a POI opened in January 2019, we will not attribute visits to the POI from January 2018 - December 2018 and will only attribute visits from January 2019 onward. On the other hand, if a POI closed in January 2019, we will only attribute visits from January 2018 - December 2018 and will not attribute visits from January 2019 - present.
We are relying on the metadata provided by ourclosed_on
,opened_on
,tracking_closed_since
, andtracking_opened_since
columns to make these determinations. If we do not have open/close information for a POI, we will treat the POI as “open” for the duration of the backfill. See here for more about how we determine POI openings/closings.
--Activity from January 2021 to June 2021 was based on the Places release of the same month as the activity. - Patterns provided/delivered between May 2020 and November 2020:
--Activity from January 2018 through and including May 2020 was generated using the May 2020 release of Places.
--Activity from May 2020 thru and including November 2020 was based on the Places release of the same month as the activity (so June 2020 activity will use the June 2020 Places release). - Patterns provided/delivered between November 2019 and April 2020:
-- Activity from January 2017 through and including October 2019 was generated using the November 2019 release of Places.
--Activity from November 2019 through and including April 2020 was based on the Places release of the same month as the activity (so December 2019 activity will use the December 2019 Places release). - Historical Patterns activity from October 2016 through and including December 2016 was generated using the April 2019 release of Places. We no longer externally provide this data.
In Weekly / Neighborhood Patterns, why do visits / stops show spikes at 8pm ET / 5pm PT?
Prior to generating visits from GPS pings, SafeGraph groups the raw GPS inputs by the day (in UTC time) on which they occurred. This has the effect of “splitting” visits that cross the UTC day boundary (7/8pm EST/EDT and 4/5pm PST/PDT). Specifically, the following columns are impacted:
raw_visit_counts
andvisits_by_day
: Visits that begin prior to and continue after the UTC day boundary will be split into two visits. POIs that conduct most of their business in the evening hours in the US will show inflated visit counts compared to those that receive most of their foot traffic in the morning and afternoon.median_dwell
andbucketed_dwell_times
: Because visits that traverse the UTC day boundary are split, dwell times will be shorter for POIs that primarily do business in the evening.visits_by_each_hour/stops_by_each_hour
(Weekly Patterns/Neighborhood Patterns): Visits will show spikes in the hour immediately after the UTC day boundary (e.g., 8-9pm EDT and 5-6pm PDT), as visits that began prior to the day boundary will be registered as a separate, standalone visit after the day boundary. Note that we providepopularity_by_each_hour
in Neighborhood Patterns as a complementary column in that product which does not have this behavior.popularity_by_hour
(Monthly Patterns): Conversely, popularity will be slightly depressed in the hour immediately following the UTC day boundary, because the GPS pings received in that hour alone may not be enough to register as a visit.- Note that columns related to visitor counts will be unaffected, as these are tied to the unique device ID that originated the pings.
What is the breakdown of SafeGraph’s mobility device panel? Is there a skew toward certain demographics?
- SafeGraph does not have any data on the demographics of devices in the panel because the data is anonymized when we receive it.
- That said, in our own publicly available analysis, we found that extrapolated demographic information based on devices within each census block group match well with overall U.S. demographics, which gives us confidence that the panel is well-sampled across demographic categories.
What are the differences between SafeGraph's US and Canada mobility panel, if any?
- Summary statistics on the size of both panels are provided on our Patterns Summary Statistics page. As a proportion of overall population, the Canadian panel is slightly smaller
- The Canadian panel has fluctuated more in terms of total size than the U.S. See statistics on total devices seen from the July 2021 backfill along with guidelines on normalization.
How does Patterns data deal with visits to POIs within a multi-floor building?
- Visit attribution in multi-story buildings is challenging because of the vertical accuracy of GPS within buildings, and not currently handled explicitly in the creation of our data.
- However, it is handled implicitly through differentiating between different categories of POIs at different times of the day. That is, our Visit Attribution algorithm uses NAICS code x Hour dummy variable features to assign visits to POIs which would in this case be on top of each other.
- How this works is in practice is that if floor two is a bakery and floor one is a bar, and the time of the visit is 10pm, our model would probably attribute that visit to the bar.
Why are Patterns visits so low to this one particular place?
- Our mobility data come from a panel that is only a sample of the whole U.S. population, meaning they provide a proxy for real traffic but there can be sampling bias for individual points of interest.
- This means that visits to individual POIs can sometimes not be well sampled among our panel.
- Because our visits attribution algorithm uses proximity as a feature, this can also sometimes occur if we don't have the geometry of the place quite right - i.e., a nearby POI can "steal" visits which should have been attributed elsewhere.
- If you think this is the case - let us know! We always do our best to improve our data wherever possible.
Updated 4 months ago