Base Attributes
The Base Attributes of our Places data provide the fundamental details about a POI. This includes location name, address, lat/long, category, brand, and more. See below for additional detail on each column.
Contents
- Column Definitions for definitions and in depth details
- Key Concepts relevant to our Places data
Helpful Links
Column Definitions: Base Attributes
placekey
placekey
Placekey is a unique and persistent identifier for any physical place in the US that intelligently partitions the ID into meaningful encodings. See the Placekey key concept for a detailed description.
parent_placekey
parent_placekey
This Placekey column will identify a larger place that may encompass a given POI, which we refer to as the "Parent". Think of an indoor shopping mall as the parent of the individual stores inside. For any place without an assigned polygon, the parent_placekey
column will be null because we rely on geometric relationships to identify parent/child hierarchy. So for example, any of our Point POI will not have an assigned parent because they do not have defined polygons. You can find out more about our process for defining these relationships in our Spatial Hierarchy section where we also include a list of all the types of places that can serve as "Parents".
location_name
location_name
The best name that can be given to the POI. This will most likely align to those business names shown on the front door (as opposed to legal entity names). For less obvious locations (like bus stops) the location_name
will display the most descriptive string possible like the name of the operator concatenated with the way the stop is identified.
safegraph_brand_ids
, brands
safegraph_brand_ids
, brands
These columns reflect the "brand" or "brands" that we associate with a given POI (and their corresponding ID we've assigned them). See our Brands Section for additional details on what we consider a brand and how we maintain them.
top_category
, sub_category
, naics_code
top_category
, sub_category
, naics_code
top_category
and sub_category
are the string labels associated with the first 4 digits and 6 digits of naics_code
, respectively. See Categorization of POI section above.
latitude
, longitude
latitude
, longitude
- In general, latitude and longitude are defined by our best knowledge of the POI location. It is not designed to specifically locate the front door of the business, but rather defines the general center of the business.
- Latitude and longitude still attempt to identify the individual business even if that business and others have the same polygon (e.g. strip mall).
street_address
street_address
- We implement a number of steps to clean, validate and standardize
street_address
. - You should expect
street_address
to be title-cased, consistent, and friendly for human reading. Please send us your feedback if you see otherwise. - If you care about street addresses as much as we do, we also have more specific address columns to split out address components. These are optional and available upon request for future deliveries.
primary_number
street_predirection
street_name
street_postdirection
street_suffix
city
city
-
In the US, city names are the output of normalized address strings from POI sources. There is no widely adopted standard for US cities, so we try our best to provide an alternative city name for each US POI in
alt_address.city
in our address attributes schema. We use the US Census Bureaus' designated places boundaries to encode an alternative city name when different from the POI source. We do this by spatially joining all US POI centroids (latitudes/longitudes) against this boundary file. -
In Canada, city names are the output of normalized address strings from POI sources.
-
In Great Britain, city names are the output of normalized address strings from POI sources, but in edge cases, we allow POIs to have a null city name as long as
region
is populated. Theregion
column in Great Britain refers to county boundaries, and counties are a decent alternative to cities for geographic filtering. -
city
may be null for POIs outside of the US and Canada as well as for National Park POIs in the U.S.
region
, iso_country_code
region
, iso_country_code
-
Starting in February 2023,
region
in Places will align with OpenStreetMap’s (OSM) understanding of global administrative boundaries. OSM utilizes a numerical hierarchy to describe geographical entities. These are denoted by the tag admin_level and a corresponding value (read more here). -
This system helps ensure a consistent understanding of a geographical unit across countries and is a widely accepted standard in the spatial data community.
region
in Places should match closely with how regions are understood for each country (e.g., prefectures in Japan). -
We recommend comparing
region
values to OSM admin_levels 3-6 and believe that admin_level = 4 is generally the best fit for most countries. -
For all countries,
iso_country_code
will be equivalent to admin_level = 2.
postal_code
postal_code
- When
iso_country_code
==US
, then this is the US 5 digit zip code. - When
iso_country_code
==CA
, then this is the Canadian postal code in the form of a 3 digit Forward Sortation Area (FSA), a space, and the 3 digit Local Delivery Unit (LDU). - When
iso_country_code
==GB
, then this is the British postal code. Learn more about Great Britain postal code precision here. postal_code
may be null for National Park POIs in the U.S.
census_code
census_code
census_code
shows the ID of a statistical area created and maintained by a country's census bureau for population and demographic reporting. In most countries, the census bureau collects data across many levels of granularity.- Ex: In the U.S., the census bureau reports data at the country, state, county, census tract, census block group, and census block level.
- We always encode the ID of the most granular unit where a country's census bureau collects and reports common population/demographic statistics.
census_code
can then be leveraged as a join key into open census datasets (like SafeGraph's Open Census Data in the US) to enrich Places with key demographic insights.
When iso_country_code
== US
, then the census_code
is a "FIPs" (Federal Information Processing) code which is a hierarchical ID that denotes the following areas in descending levels of granularity: state, county, census tract, and census block group.
-
Example FIPS code: 012345678901
- 01 = state
- 01234 = county
- 01234567890 = census tract
- 012345678901 = census block group
-
That's four keys in one ID! 🎉 Census Block Groups (CBGs) are the second smallest geographical unit of analysis maintained by the US Census Bureau, and the smallest unit of analysis used for demographic reporting. A typical CBG contains ~2000-7500 residents. SafeGraph currently uses the 2020-2029 US Census Bureau data to derive
census_code
in the U.S.
When iso_country_code
== CA
, then the census_code
is a dissemination area.
- Dissemination areas are the smallest unit of analysis used for demographic reporting in Canada.
census_code
is currently null where iso_country_code
<> US
and iso_country_code
<> CA
Key Concepts
Placekey
Placekey is an identifier for any physical place in the world that partitions the ID into meaningful encodings. Placekey allows information about places to be easily shared across organizations and data sets by simplifying the merging of data on physical places. This enables deduplication, normalization, and place entity recognition.
Each Placekey is divided into two parts: What and Where, written as “What@Where”. This is a unique way of shedding light on both the descriptive element of a place as well as its geospatial position in the physical world.
What: Address and POI Encoding
The “What” part of a Placekey is optional and encodes the Address and the POI (if there is a POI). An address at “555 Main Street Suite 105” will have a different What Encoding than “555 Main Street Suite 106.” However, "444 Second Street, Suite 4" will have the same address encoding as "444 2nd St. #4" to adjust for common address formats.
If a specific place has a location name (like "Central Park") and is already included in the Placekey reference datasets, these characters will be present. The benefit of the POI Encoding is that it can point to a specific point of interest that may have existed at a certain address at a given point in time.
Where: H3 Encoding
The 'Where’ part of a Placekey is built upon Uber’s open source H3 grid system. This information in the 'Where’ part is based on the centroid of that place. In other words, we take the latitude and longitude of a specific place and then use a conversion function to determine a hexagon in the physical world, representing about 15,000 sq. meters, containing the centroid of that place. The 'Where’ of the Placekey is, therefore, the full encoding of that hexagon. Each ‘Where’ is specified by 9 characters. The string does not explicitly code exact spatial distances, but the code does become more specific when reading left to right.
Open access to your own datasets using the Placekey API.
Brands
SafeGraph curates thousands of distinct brands with more added every month. These are chains of commercial POIs that represent major brands around the world (McDonald's, AMC, Macy's, Chevrolet, Whole Foods Market, etc.). But they can also reflect regional brands that may only have a handful of locations, as long as they are operating under a common logo or store banner.
Note that ~80% of POIs have no brand associated as they are single commercial locations (local restaurants, museums, etc.). SafeGraph is continually improving the fill rate of brands with each release - please contact us if you notice a missing brand.
Brands provide an easy way to isolate major stores. If you know you are searching for a brand that we cover, we advise searching by the brands
column instead of the location_name
column. For even better specificity, search the brand_info file by brands
and build your workflows around safegraph_brand_id
.
Every place has a location_name
, but only POIs belonging to a chain will have a brand
. In some cases, location_name
and brand
will be the same, but in other cases they are intentionally different. For example, the most common name for an individual Starbucks store is its brand name, so it is also reflected in the location_name
column. However, the most common name for the Bellagio Hotel & Casino is not its brand name "MGM Resorts." In this case, the location_name
shows "Bellagio Hotel & Casino" and brands
shows "MGM Resorts."
Car dealer brands (naics_code
= 441110): A car dealership may sell multiple car brands, and in these cases, the brands
and safegraph_brand_ids
are listed as an array that is alphabetized by brand name (the order does not specify any importance). This is the only category that currently boasts a 1:many place<>brand relationship.
Categorization of POI
SafeGraph Places uses the North American Industry Classification System (NAICS) developed by the US Census Bureau, which consists of a numeric NAICS code up to 6 digits in length. Although this taxonomy was developed in the US, we have found it just as useful for categorizing POIs in other countries as well and will continue to use it until a better alternative presents itself. We currently reference the 2017 version of NAICS. We will provide an update if and when we ultimately update to reflect the 2022 changes.
The NAICS code itself is hierarchical; in other words, the first 2 digits describe a very general category, and additional digits describe more and more specific categories. For example:
72
is the general categoryAccommodation and Food Services
.722
is the more specific categoryFood Services and Drinking Places
.7225
is the even more specific categoryRestaurants and Other Eating Places
.722513
is the most specific categoryLimited-Service Restaurants
(i.e. quick-serve or fast-food restaurants).
We strive to assign a best fitting naics_code
for all of our POIs. Our goal is to assign a full six digits for maximum granularity wherever possible, but our category algorithm cannot always infer a high confidence six digit naics_code
based on POI name and other descriptive metadata. In these cases, we provide a shorter naics_code
where we do have high confidence in the assignment (i.e. 3, 4 , 5 digits). In these circumstances, we choose to sacrifice the extra digits of precision in exchange for high veracity predictions and also because the extra precision is not always meaningfully different (i.e. some adjacent 6 digit NAICS are extremely similar).
See our Places Summary Statistics for the latest details on counts and coverage.
Also see our use of Category Tags to provide more flexibility and granularity where the NAICS code classification falls short.
Updated 3 months ago