The SafeGraph Developer Hub

Welcome to the SafeGraph developer hub. You'll find comprehensive guides and documentation to help you start working with SafeGraph as quickly as possible, as well as support if you get stuck. Let's jump right in!

Get Started    

Places Manual

This document provides details on attribute methodology and answers frequently asked questions (FAQs) about the nuances of the SafeGraph Places dataset.

Core Places

safegraph_place_id

The SafeGraph Place ID safegraph_place_id is a persistent unique identifier for a place of interest across releases and the primary key in Places datasets.

Please note that a SafeGraph Place ID may change across releases for a small subset of places of interest (POIs). Let us know if you happen to notice an inconsistency and we'll address it ASAP!

safegraph_brand_id

  • SafeGraph curates over 4300 distinct brands (and growing). These are chains of commercial POIs that include all major brands in the United States (McDonald's, AMC, Macy's Chevrolet, Whole Foods Market, etc.).
  • ~1 million POIs are associated with at least one brand. Please note that ~4 million POIs have no brand associated as they are single commercial locations (local restaurants, museum, etc.). Please see Places Summary Statistics for more detail. SafeGraph is continually improving the fill rate of brands with each release -- please contact us if you notice a brand missing.
  • Some POIs include multiple brands. Car dealerships are a good example of this: a given dealership may sell multiple car brands. Another example is POIs that are co-located, such as some Taco Bell & KFC stores, or IMAX and AMC (or Regal, etc.) cinemas. In these cases the brands and brand_ids are listed as an array that is alphabetized by brand name and the order does not specify any importance.
  • Brands provide an easy way to isolate for only major stores. If you know you are searching for a brand that we cover, we advise searching the brand column instead of the name column. Even better is to search the brand_info file and build your workflows around safegraph_brand_id.
  • Every place has a name but only POIs belonging to a chain will have a brand. In certain cases, name and brand will be the same but in other cases, these fields may be different.
  • For example, if you’re searching for all McDonald’s (fast-food) stores, you would search for all POI entries where brands = ‘mcdonalds’. Many stores may be called mcdonalds that are not the fast-food chain and therefore searching for name = 'mcdonalds' would incorrectly return non-fast-food stores. Please note that you may see alternative names in the name column even once you filter for brands = 'mcdonalds', but these will all be McDonald's fast food stores.
  • Pleae note that if you are having difficulty matching location or brand names to listings of POI that you have, we offer a matching service that will provide you with SGPID's of locations mapped to your existing POI data.
Name Brand Comment
mcdonald's us mcdonalds
mcdonald's store mcdonalds You may occasionally get variations in the name column, but as long as you query the brand column correctly, you’ll get all the McDonald’s correctly.
mcdonald’s tractor supply store null This POI that is not a branch of the McDonald’s fast food chain is easy to notice because it has a null value in the brands column.

street_address

  • We implement a number of steps to clean, validate and standardize street_address.
  • You should expect street_address to be Title Cased, consistent and friendly for human reading. Please send us your feedback if you see otherwise.
  • If you care about street addresses as much as we do, we also have more specific address columns to split out address components. These are optional and available upon request for future deliveries!
    • primary_number
    • street_predirection
    • street_name
    • street_postdirection
    • street_suffix

naics_code, top_category, sub_category

  • SafeGraph Places uses the NAICS categorization taxonomy developed by the US Census Bureau that consists of a numeric NAICS code up to 6 digits in length.
  • The code itself is hierarchical; in other words, the first 2 digits describe a very general category, and additional digits describe more and more specific categories. For example:
    • 72 is the general category Accommodation and Food Services.
    • 722 is the more specific category Food Services and Drinking Places.
    • 7225 is the even more specific category Restaurants and Other Eating Places.
    • 722513 is the most specific category Limited-Service Restaurants (i.e. quick-serve or fast-food restaurants).
  • Category information is available for almost all our POI (see latest fill rate stats here). However, there are some businesses where we are not sure about the NAICS category; in these cases, NAICS will be left blank.
  • top_category and sub_category are the string labels associated with the first 4 digits and 6 digits of naics_code, respectively.

latitude & longitude

  • In general, latitude and longitude are defined by our best knowledge of the POI location. It is not designed to specifically locate the front door of the business, but rather defines the general center of the business.
  • Latitude and longitude still attempts to identify the individual business even if that business and others have the same polygon (e.g. strip mall).

open_hours

The new format for open hours is a JSON string with days as keys and opening & closing times (in the POI's local time) as values

  • Each JSON string is guaranteed to have all 7 days as keys
  • We indicate that a POI is closed for the day by giving it a value of "[]"
  • We indicate that a POI is open the entire day by using a format like:

"Thu": [["0:00", "24:00"]]

  • For POI that open and close multiple times throughout the day (e.g. a restaurant open in the morning and evening but not midday), we list multiple opening/closing pairs. For example:

“Sat": [["8:00", "13:00"], ["15:00", "22:30"]]

  • This indicates that a POI is open from 8 am to 1 p.m. and also from 3 p.m. to 10:30 p.m. on Saturday.

  • For POI that open and close on different days (e.g. a bar which opens at Tuesday at 6 p.m. and closes at Wednesday at 2 a.m.), we use a format like:

"Tue": [["18:00", "24:00"]], "Wed": [["0:00", "2:00"]]

To re-iterate: a “closing time” of 24:00 doesn’t mean the POI actually closes at midnight, if it’s followed by an opening time of 0:00 on the following day.

Example Open Hours JSON string

{ "Mon": [["8:00", "22:00"]], "Tue": [["8:00", "13:00"], ["18:00", "24:00"]], "Wed": [["0:00", "2:00"]], "Thu": [["0:00", "24:00"]], "Fri": [["23:00", "24:00"]], "Sat": [["0:00", "3:00"], ["15:00", "22:30"]], "Sun": [] }

This example represents the following open / close times:

  • Open from 8 a.m. to 10 p.m. on Monday
  • Open from 8 a.m. to 1 p.m. and 6 p.m. onwards on Tuesday
  • Open until 2 a.m. on Wednesday (note: open from Tuesday 6pm through 2am Wednesday)
  • Open all day on Thursday (i.e. midnight Wednesday to midnight Thursday)
  • Open from 11 p.m. onwards on Friday
  • Open until 3 a.m. and between 3 p.m. and 10:30 p.m. on Saturday
  • Closed on Sunday

phone_number

This is a 10 digit phone number. We filter out toll-free numbers (e.g. 1-800) and strive to have POI-specific numbers (not franchise-level or corporate-level numbers).

Geometry

polygon_wkt

  • Spatial Reference used: EPSG:4326
  • WKT stands for Well-Known-Text. It’s a simple way to define a polygon/shape and is the standard format for polygons in SafeGraph Places.
  • Other geospatial file formats you may utilize include Shapefile and GeoJSON. WKT can easily be converted to these formats and file conversions are available by request.

parent_safegraph_id

  • For specific categories of places that tend to have a large number of “tenant” or “substores”, we explicitly try to identify a store as a sub-store and tie it to its parent (containing) store. Those categories include indoor malls, airports, college campuses, and stadiums.
  • In these cases, the tenant store will have a parent_safegraph_place_id that refers to the safegraph_place_id of the parent place of interest.
  • If a POI is not a tenant store, then the corresponding parent_safegraph_place_id will be null.

    polygon_class

  • We include the column polygon_class to help identify how this POI fits in the spatial hierarchy of other POIs.
  • In dense environments such as indoor malls or multi-story buildings, we might not be confident about a POI’s true shape. In such cases, we will provide the overall structure polygon instead. This may result in several such POIs having the same polygon. In these cases, we will note that this POI is inside another POI in polygon_class.
  • In cases like this, if you need to be able to differentiate different stores within a shared polygon, you can use POI centroids. Since user GPS signals often drift inside of large structures, for use cases such as determining places visited by a user, we have found that user distance to centroid is a good substitute for distance to polygon.

includes_parking_lot

  • In some cases, our polygons intentionally include the parking lot since the parking lot (e.g., car dealerships and gas stations). The value of the includes_parking_lot column is to make explicit to our customers when the polygon_wkt does or does not include the parking lot. There are three possible values true, false, and null (null when we are not sure whether a parking lot is included in the geometry).

Patterns

raw_visit_counts

These are the aggregated raw counts that we see visit the POI from our panel of mobile devices.

  • We do not include any POI with less than 5 visits in total.

These values should be taken in the context of specific nuances & biases in our dataset:

Geographic Bias

  • Small geographic bias exists in our panel based on our understanding of the home locations of the devices in the panel.
  • SafeGraph tested for geographic bias by comparing its determination of the state-by-state numbers of home location of the devices in the panel to the true proportions reported by the 2016 US Census.
  • Based on that analysis, SafeGraph panel density closely mirrors true population density. Overall average percentage point difference < 1%. Maximum +/-3% per state.
  • For a deep dive on geographic bias in the panel, see Quantifying Sampling Bias in SafeGraph Patterns.

Panel Growth

  • The panel has grown significantly since its inception. As such, it is important to normalize the data when doing time series analysis across long periods of time or multiple releases.
  • We have seen success by normalizing visits by the total number of visits in the SafeGraph Panel, month by month. It is also worth exploring normalizing based on state or census block group. With each delivery, we provide you with the Panel Overview Data files to enable you to do these calculations.

Predicting Financial Indicators
SafeGraph data can be used to estimate foot traffic and predict financial indicators of companies ( number of visitors, revenue, etc.). Please see Normalization White Paper: How to Use SafeGraph Visits Data to Predict Company Reported KPIs.

Correlation between reported company KPIs and SafeGraph visits will vary depending on multiple factors related to the company:

  • Does the business separately report online vs in-store sales and revenue.
  • How much do online sales contribute to the overall revenue.
  • How much revenue is generated outside of the USA (SafeGraph Visits are US only).
  • The ground truth correlation between foot traffic and sales for that business. e.g. the relationship between foot traffic and sales at a car dealership has a very different pattern than at a convenience store.

Visits to Dense Urban Areas

  • Visits to urban, suburban, and rural areas have varying precision levels. It is more difficult to measure visits to a midtown Manhattan Starbucks than a visit to a suburban standalone Starbucks. On average, rural visits have a 97% precision level, while suburban visits have 83% precision and urban areas 71% precision.

Visits to Large Structures/Indoor Malls

  • We attribute visits to the containing element i.e. the indoor mall or airport, and not to any individual tenants. We believe this is the most accurate option given the limitations of GPS inside such structures (e.g., indoor malls, casinos, hotels).

Visits to Strip Malls

  • We attribute visits to the individual stores as well as the parent strip mall (assuming we have a POI for the entire strip mall). There will be instances where we have not divided a strip mall polygon into its constituent stores. Our model to determine visits does take a number of factors into account, including distance from centroid, so even though there are multiple POI in one strip mall polygon, we attempt to allocate visits within the strip mall to the POI most likely to have received the visit.

Worker & Non-Worker Visits

  • To the extent we have identified a device as belonging to a worker at a POI, we exclude this device from our visit counts (on the theory that most customers are interested in shoppers/visitors).
  • We determine that a device belongs to a worker by looking at 1 month of data and whether the device was at the POI during traditional work hours and not during the weekend.

GPS Data

  • The visits are determined using GPS data.
  • We do not include any GPS data with a horizontal accuracy greater than 160 meters.

raw_visitor_counts

  • These are the aggregated raw counts of visitors.
  • We do not include a POI unless there are 5 visitors to this POI in the month with at least 1 visit.
  • We do not include visitors which we have determined are workers at the POI.

visits_by_day

  • This is an array of visits on each day in the month.
  • We are breaking up days based on local time.
  • Because our one-month snapshot is using UTC time, and we are representing days in local time, the last day of the month is cut off. For instance, California PST is 8 hours behind UTC and California PDT is 7 hours behind UTC. This means that during Daylight Savings Time, the last day in the array is missing the last 7 hours of the day in local time (between 5 pm and midnight).

visitor_home_cbgs

  • These are the home census block groups of the visitors to the POI.
  • For each census block group, we show the number of associated visitors (as opposed to the number of visits).
  • We do not have a home census block group for each visitor and not each visitor originates from the U.S. The number of U.S. visitors listed in the visitor_country_of_origin column represents the total number of visitors which we have determined originate from the U.S.
  • We do not include a census block group unless there are at least 5 visitors from that census block group.
  • We determine the home census block group by analyzing 6 weeks of data during nighttime hours (between 6 pm and 7 am). We require a sufficient amount of evidence (total data points and distinct days) to assign a home (common nighttime) location for the device.
  • The census block group is the highest geographic resolution for which the US Census provides demographic information. This demographic data is publicly available through APIs maintained by the US Census. SafeGraph provides census block group demographic data to download for free. There are also resources for developers on Github and Stackoverflow for working with the US Census APIs. Some of the most common APIs are the Population Estimates API and the Decennial Census
  • See also: How do I work with Patterns columns that contain JSON

visitor_work_cbgs

  • These are the work census block groups of the visitors to the POI.
  • For each census block group, we show the number of associated visitors (as opposed to the number of visits).
  • We determine the work census block group of a device by looking at 1 month of data and determining where the device is most frequently during traditional work hours and is not during the weekend or overnight. It is easier to determine a home/common nighttime location of a device than it is to determine the work census block group so our data contains more home_cbgs than work_cbgs.
  • See also: How do I work with Patterns columns that contain JSON

visitor_country_of_origin

distance_from_home

  • This is the median distance from home to the POI in meters for the visitors we have identified a home location.
  • We do not adjust for visits -- each visitor is counted equally.
  • In this calculation, we include census block groups that are not in the home_cbgs field if there were less than 5 visitors from that census block group. This means that we could show no home_cbgs but yet have a distance_from_home.

median_dwell

  • This is the median of the minimum dwell times we have calculated for each of the visits to the POI.
  • We determine the median dwell time by looking at the first and last ping we see from a device during a visit. This is a minimum dwell because it is possible the device was at the POI longer than the time of the last ping.
  • It is possible to have a minimum dwell of 0 if we only saw 1 ping and determined the visit based on factors such as wifi.

bucketed_dwell_times

related_same_day_brand

These are the brands that the visitors to this POI visit (on the same day that they visit the POI) in higher numbers than the general members of our panel. The number mapped to each brand is an indicator of how highly correlated a POI is to a certain brand beyond what we are seeing generally in the panel. For example, if a lot of visitors to Starbucks at 123 Main Street also tend to visit an unpopular brand on the same day, the number could be quite high (e.g., > 50) whereas if the same number of 123 Main Street Starbucks visitors also visit Targets on the same day, the number will be lower because Target is a popular brand.

See also: How do I work with Patterns columns that contain JSON

If you want to know the nitty-gritty of how we calculate this index, read on at your own risk:

  • For each day in the month, we find the total number of visitors who went to both the POI and another branded location. For each brand, we divide this number by the total number of visitors to the POI. This gives us our "POI Specific Brand Ratios" for each brand for each day in the month.
  • For each day in the month, we find the number of visitors who went to each brand divided by the total number of visitors in the Panel. This gives us our "Baseline Brand Ratios" for each brand for each day in the month.
  • For each brand, we take the POI Specific Brand Ratio for each day of the month and subtract from it the corresponding Baseline Brand Ratio (the "Daily Percentage"). We then take the median of the differences. If the result is greater or equal to 5%, we include the brand in the list.
  • In determining the median we exclude any POI Specific Brand Ratios that are 0.
  • Note that the final number is rounded so it is possible to have 100 (likely because the applicable Baseline Brand Ratio is less than 0.5%).

For example, if on the first of the month, 20 visitors out of 100 that went to a certain SoulCycle POI also went to a Sephora while in the Panel generally, only 2 out of 100 visitors went to a Sephora, the Daily Percentage would be 18% (20/100 - 2/100). This 18% would be included with the other Daily Percentages for the month to determine the median of those numbers.

related_same_month_brand

These are the brands that the visitors to this POI visit in higher numbers than the general members of our panel over the course of the month. The number mapped to each brand is an indicator of how highly correlated a POI is to a certain brand beyond what we are seeing generally in the panel. For example, if visitors to Starbucks at 123 Main Street also tend to visit an unpopular brand a lot, the number could be quite high (e.g., > 50) whereas if the same number of 123 Main Street Starbucks visitors also visit Targets a lot, the number will be lower because Target is a popular brand.

See also: How do I work with Patterns columns that contain JSON

If you want to know the nitty-gritty of how we calculate this index, read on at your own risk:

  • For the entire month, we find the total number of visitors who went to both the POI and another branded location. For each brand, we divide this number by the total number of visits to the POI. This gives us our "POI Specific Brand Ratios" for each brand.
  • For the entire month, we find the number of visitors who went to each brand divided by the total number of visitors in the Panel. This gives us our "Baseline Brand Ratios" for each brand.
  • For each brand, we take the POI Specific Brand Ratio and subtract from it the corresponding Baseline Brand Ratio. If the result is greater or equal to 5%, we include the brand in the list.
  • Note that the final number is rounded so it is possible to have 100 (likely because the applicable Baseline Brand Ratio is less than 0.5%).

For example, if for the entire month 20 visitors out of 100 that went to a certain SoulCycle POI also went to a Sephora while in the Panel generally, only 2 out of 100 visitors went to a Sephora, the percentage would be 18% (20/100 - 2/100).

popularity_by_hour

  • This is an array of visits seen in each hour of the day over the course of the month.
  • Local time is used.
  • If a visitor stays for multiple hours, an item in the array will be incremented for each hour during which the visitor stayed. This means that if you sum the numbers in the popularity_by_hour array the sum will likely be greater than the amount shown in the raw_visit_counts column (since the raw_visit_counts counts a multiple hour visit as one visit).

popularity_by_day

device_type

Column Orderings

  • Files are delivered in the places_joined delivery format. The exact columns of your delivery depend on which fo the products you purchased. If column order matters to you, take heed. Full schema for the places_joined as follows:

    • The order of columns for Core + Geometry + Patterns is safegraph_place_id,parent_safegraph_place_id,safegraph_brand_ids,location_name,brands,top_category,sub_category,naics_code,latitude,longitude,street_address,city,region,postal_code,open_hours,polygon_wkt,polygon_class,phone_number,is_synthetic,includes_parking_lot,iso_country_code,date_range_start,date_range_end,raw_visit_counts,raw_visitor_counts,visits_by_day,visitor_home_cbgs,visitor_work_cbgs,visitor_country_of_origin,distance_from_home,median_dwell,bucketed_dwell_times,related_same_day_brand,related_same_month_brand,popularity_by_hour,popularity_by_day,device_type
    • The order of columns for combined Core + Geometry is safegraph_place_id,parent_safegraph_place_id,safegraph_brand_ids,location_name,brands,top_category,sub_category,naics_code,latitude,longitude,street_address,city,region,postal_code,open_hours,polygon_wkt,polygon_class,phone_number,is_synthetic,includes_parking_lot,iso_country_code
    • The order of columns for Patterns (only) is safegraph_place_id,location_name,street_address,city,region,postal_code,brands,date_range_start,date_range_end,raw_visit_counts,raw_visitor_counts,visits_by_day,visitor_home_cbgs,visitor_work_cbgs,visitor_country_of_origin,distance_from_home,median_dwell,bucketed_dwell_times,related_same_day_brand,related_same_month_brand,popularity_by_hour,popularity_by_day,device_type,iso_country_code
    • The order of columns for Core (only) is safegraph_place_id,parent_safegraph_place_id,safegraph_brand_ids,location_name,brands,top_category,sub_category,naics_code,latitude,longitude,street_address,city,region,postal_code.

Delivery Cadence and Directory Structure

  • If you are an enterprise customer, SafeGraph products are delivered (together) on ~ 7th of the month.
    • Up to 4 files will be delivered with the following structure: s3://customer-bucket/customer-prefix/{{sg-file-name}}/yyyy/mm/dd/hh/*.csv.gz. {{sg-file-name}} is one of the following:
      • core_poi, geometry, patterns or some combination like core_poi-geometry or core_poi-patterns or core_poi-geometry-patterns (depending on your subscription) will include all of the following for which you are subscribed: Core, Geometry, Patterns.
      • brand_info (if subscribed to Core)
      • home_panel_summary (if subscribed to Patterns)
      • visits_panel_summary (if subscribed to Patterns)

Places Manual


This document provides details on attribute methodology and answers frequently asked questions (FAQs) about the nuances of the SafeGraph Places dataset.

Suggested Edits are limited on API Reference Pages

You can only suggest edits to Markdown body content, but not to the API spec.