About Dataset Records

In the GHDx, a record is a catalog entry for a particular dataset. A record consists of general metadata about the dataset, a citation and other source information, and information about where to obtain the dataset.

We define a dataset as a particular distribution or collection of data stemming from a single data collection, aggregation or synthesis effort. Datasets may have one or more files and should include both data files and documentation explaining the individual variables as well as the data collection or synthesis methodology. The data in the dataset may stem from primary data collection (e.g. survey) or secondary data via aggregation or synthesis. For purposes of reference, datasets should be understood to have unique citations, such that if data from a survey have separate distributions from the different organizations conducting the survey, each of these is a unique dataset.

A dataset may be the output of an ongoing data collection system, where the dataset is an extract of the system: a selection of data from the system where the selection is based on some useful criteria; most common is the data on a particular topic for a particular year. For example, the death data for 2011 from a vital registration system. See About Series and Systems for more information.

In general we create one record per dataset, but because some datasets are broad or particularly complex, we may create multiple entries for different components of a dataset in order to facilitate search, such as in the case of the WHO Mortality Database.

Dataset Titles

Because datasets frequently have multiple names or lack official titles, we standardize all dataset titles in order to avoid creating duplicate entries and to improve search results. All titles (and record information) are in English to prevent duplicates based on an English title and the title in the original language. Our titles, as much as possible, appear in the formula: Geography - Name - Years. This titling allows you to sort search results by title and group data from the same geographic area more easily. If a dataset has multiple versions, created by different organizations, generally the secondary organization name appears at the end, such as the IPUMS census series where IPUMS has cleaned and harmonized census data from other organizations. We include the original title or titles in the metadata for the dataset.

Header sites

Search form

Main menu

About Dataset Records

Dataset Titles