Coverage Technical Report, Census of Population, 2021
8. Census Overcoverage Study
8.1 Overview
Overcoverage error occurs when in-scope individuals are enumerated more than once or when individuals who should not have been enumerated are included in the target population of a survey or a census. The purpose of the Census Overcoverage Study (COS) is to estimate the number of persons enumerated more than once in the Canadian Census of Population.Note 1
The 2021 COS consisted of two types of linkages, namely deterministic and probabilistic. The deterministic linkage (DL) identified definite pairs of duplicate persons, meaning those persons were enumerated more than once and hence represent overcoverage. The methodology was based on a modification of the Automated Match Study (AMS), which had been used in previous census cycles to evaluate the COS. The probabilistic linkage (PL) identified possible pairs of duplicate persons and was based on the methods used in past cycles of the COS. The COS used data from the 2021 Census Response Database and administrative data from the Canadian Statistical Demographic Database provided by the Census Research section of the Statistical Integration Methods Division. The COS sampling frame was created in multiple steps and includes definite and possible pairs of duplicate persons identified with both the DL and PL, along with an extension of the sampling frame based on households. A sample of possible pairs of duplicate persons was drawn from the COS frame and sent for manual verification to determine whether the sampled pairs were indeed duplicate persons. With the result of the manual verification of the sampled pairs, and definite pairs of duplicate persons identified by the DL, an estimate of overcoverage was then obtained.
8.2 Linkage steps
8.2.1 Data used for the linkages
Two sources of data were used for the linkages.
Firstly, the Census Coverage Studies version of the Census Response Database (CCS-RDB, referred to as the RDB in this chapter) was a version of the Census Response Database that did not include late or incomplete enumerations, or persons added through the whole household imputation process. The RDB contained a little over 35 million records and included responses from individuals living in both private and collective dwellings. It contained names (including given names and surnames), demographic information (including date of birth and sex) and geographic information (including province or territory and postal code, and census geographic variables such as collection unit (CU), census subdivision (CSD) and census metropolitan area (CMA)).
Secondly, administrative (ADM) data were used, based on the Canadian Statistical Demographic Database provided by the Census Research section of the Statistical Integration Methods Division. They comprised records from multiple ADM data sources and aimed to represent persons in scope for the census. The ADM data consisted of around 53 million records. They included names (given names and surnames), demographic information (including date of birth and sex) and geographic information (including province or territory and postal code).
The following matching variables were used in the linkages (when applicable):
- names: given name(s) and surname(s) variables
- demographic data: date of birth and sex variables
- geographic data: province or territory and postal code, and census geographic variables.
8.2.2 Deterministic linkage
The purpose of the DL was to identify high-quality pairs of duplicate persons, consisting of two records from the RDB, which were classified as definite pairs of overcoverage. The deterministic matching programs traditionally used for the AMS were modified to include, as part of the linkage criteria, a comparison of names and also considered matches between a household living in a private dwelling and a household living in a collective dwelling.
The DL was based on the following series of operations:
- Deterministic matching programs were used to identify household pairs that were “similar.” Similarity was described in terms of their relative geographic proximity (households within the same CU, households in different CUs within the same CSD, etc.) and the number of persons matched between them. Persons were matched based on the variables of name, sex and date of birth. Two persons were said to be an exact match if they had the same sex; day, month and year of birth; and names also match. Two persons were said to be a near match if their names matched and three of the four other components (sex and day, month and year of birth) agreed or just the day and month of birth were reversed. Household pairs consisted of one or both households living in a private dwelling.
- An initial list of possible pairs of duplicate persons was created from household pairs.
- A verification sample was taken from the initial list of possible pairs of duplicate persons for manual verification purposes to confirm their high quality before classifying them as definite pairs of duplicate persons (i.e., overcoverage).
- A final list of pairs of duplicate persons was determined, and they were classified as definite pairs of duplicate persons resulting from the DL.
There were 460,572 definite pairs of duplicate persons resulting from the DL.
8.2.3 Probabilistic linkage
The purpose of the PL was to identify possible pairs of duplicate persons. The PL consisted of an internal probabilistic record linkage of the entire RDB to itself, referred to as the
PL is conducted with G-Link, a probabilistic record linkage system designed at Statistics Canada that uses the Fellegi–Sunter method to solve large file linkage problems when there are no direct identifiers common to both sources (Fellegi and Sunter, 1969). As in past cycles, G-Link was used in 2021, and the following series of operations were done separately for the
The first task in a probabilistic linkage is to build a set of potential pairs (also known as a linked set),which is used to estimate the characteristics of the set of true matched pairs. To do this, a set of selection criteria was applied, which reduced the Cartesian product of all the possible matches to a more manageable comparison space. Improvements were made in the 2021 selection criteria to overcome challenges that arose with the 2016 selection criteria. In addition, rather than use identical selection criteria for both the internal and external linkages, criteria were developed, tested and optimized separately for these two linkages. Many of the RDB pairs derived from the
Once a linked set of pairs was obtained, the records of the pairs were compared by applying linkage rules in G-Link, which calculated the weights of the results of the linkage rules. Quality linkage rules that address all sets of characteristics for which two records agree were necessary to ensure the completeness of the COS sampling frame resulting from the PL. If some sets of characteristics are not addressed by the linkage rules, then pairs with such characteristics are likely to be assigned a lower linkage weight and to be rejected when thresholds are applied. Many improvements were made in the 2021 linkage rules to ensure that estimated linkage weights were well correlated with the likelihood of a pair being a true match. In 2021, more linkage variables were added to the rules, and the outcomes for existing rules that had been used in 2016 were modified, such as the rules on names. Outcomes based on census-specific geographic variables—such as the unique identifier of a dwelling (known as the FRAME_ID), CU and CSD—were added in 2021 and were applicable only to the
A linkage weight threshold for each province and territory was then established separately for the
8.3 Creation of the Census Overcoverage Study sampling frame
The COS sampling frame was created in multiple steps and included linked pairs identified with the DL and the PL, along with an extension of the sampling frame based on households. Then, sampling units are created.
As previously described, the DL was used to identify a set of
As in previous cycles, the frame was then enriched with additional pairs not already identified by the PL but created from the households of pairs linked by the internal and external linkage steps. The purpose of this step was to identify additional possible pairs of duplicate persons in the households of captured pairs that may not have been caught with the PL, because the PL was based on comparisons of individuals rather than households. Potential pairs from this step were known as extension pairs and classified as possible pairs of duplicate persons. To construct the set of extension pairs, a household pair was first produced for each
The final linked set comprised pairs from the DL, extension pairs, pairs from the PL of the
Linkage type | Frequency | Percent |
---|---|---|
DL = deterministic linkage RDB-ADM = probabilistic linkage of Census Response Database to administrative data RDB-RDB = probabilistic linkage of Census Response Database to itself Source: Statistics Canada, 2021 Census Overcoverage Study. |
||
DL | 460,572 | 3.62 |
Extension | 471,688 | 3.71 |
RDB-ADM | 4,301,351 | 33.80 |
RDB-RDB | 7,491,998 | 58.87 |
When pairs in the PL set were also found by the DL, the linkage type was set to DL. Then, the possible pairs of duplicate persons obtained from the DL, the PL and the extension were combined and deduplicated.
Since 2011, the COS has used interconnected record groups to estimate overcoverage in the census rather than record pairs. This is because overcoverage estimated by record pairs would be positively biased in the presence of triple or higher-order enumerations. Thus, mutually exclusive groups of connected RDB records were formed, where most of the groups of records on the frame resulted in one or two pairs (involving two or three records). For cases where the groups of records contained more than 10 links, a graph theory method was applied to reduce the group into small subgroups called “neighbourhoods” (Dasylva et al., 2015) to facilitate manual verification.
Lastly, the COS sampling frame consisted of three types of sampling units: pairs, groups and neighbourhoods. Sampling units were categorized by three process types: (1) DL-only, composed of pairs and groups or neighbourhoods of RDB records resulting from the DL; (2) PL-only, composed of pairs and groups or neighbourhoods of RDB records resulting from the
Sampling unit types | Process type | Total | ||
---|---|---|---|---|
DL-only | PL-only | PL-DL | ||
DL = deterministic linkage PL = probabilistic linkage PL-DL = probabilistic linkage-deterministic linkage (some of the pairs in the group were identified by the probabilistic linkage only, while others were identified by the deterministic linkage) Source: Statistics Canada, 2021 Census Overcoverage Study. |
||||
Group | 4,822 | 1,635,296 | 86,930 | 1,727,048 |
Neighbourhood | 64 | 161,641 | 6,493 | 168,198 |
Pair | 345,243 | 5,931,084 | 0 | 6,276,327 |
Total | 350,129 | 7,728,021 | 93,423 | 8,171,573 |
8.4 Sample design
The first level of stratification was by linkage process type, resulting in three strata:
- Stratum 1 consisted of DL pairs and groups or neighbourhoods made up of DL pairs only. This was treated as a take-all stratum, and sampling units in this stratum were classified as definite pairs of duplicate persons.
- Stratum 2 consisted of PL pairs and groups or neighbourhoods that contained only PL pairs. A probabilistic sample was drawn from this stratum, and the pairs were sent for manual verification.
- Stratum 3 consisted of groups or neighbourhoods that had a combination of PL and DL pairs. This stratum was further divided into two substrata. The first substratum was composed of groups and neighbourhoods that contained at least one DL pair that was sampled as part of the DL verification sample used to confirm the quality of these pairs. This substratum was treated as take-all. The second substratum was composed of groups and neighbourhoods that did not contain any DL pairs that were part of the DL verification sample. It had a probabilistic sample of PL–DL groups or neighbourhoods drawn from it. PL pairs in groups with DL pairs belonging to the first substratum were sent for manual verification, along with the PL and DL pairs selected from the second substratum.
The targeted sample size was approximately 55,000 pairs from the PL-only stratum and around 4,500 pairs from the PL–DL stratum. In this section, intraprovincial means all RDB records in a sampling unit are from the same province or territory, and interprovincial means RDB records in a sampling unit are from more than one province or territory. Tables in this section present counts of pairs whether the sampling unit is a pair, group or neighbourhood. Groups and neighbourhoods are broken down into their constituent pairs to derive the count of pairs. For simplicity, sampled pairs were sent for manual verification rather than groups of records.
For the PL-only stratum, the sampling unit type substrata were further stratified into 14 strata: 13 provincial strata containing sampling units (pairs or interconnected record groups or neighbourhoods) where all of the records belong to the same province or territory, and an interprovincial stratum where the sampling units have records from different provinces or territories. As in 2016, the interprovincial units may be groups that also contain some intraprovincial pairs. This was unavoidable when using interconnected record groups. To better control the sample size, the group and neighbourhood sampling units were further stratified by the number of pairs in the group. Finally, the sampling units were sorted by the estimated overcoverage propensity in the case of groups or neighbourhoods and by their conditional match probabilities in the case of pairs, and a systematic sample was then drawn.Note 2
For the first PL–DL substratum, the DL pairs that were part of the verification sample had already been verified and so were not sent for manual verification. This was advantageous and allowed for a larger sample in the PL–DL substratum with at least one DL pair in the verification sample. The PL–DL groups for which none of the DL pairs were part of the verification sample were further stratified into 14 strata: 13 intraprovincial strata and an interprovincial stratum. As with the PL-only stratum, these 14 substrata were further stratified by the number of links to better control the sample size. The sampling units were then sorted by the estimated overcoverage propensity, and a systematic sample was drawn.
8.4.1 Deterministic linkage-only stratum
As mentioned above, the DL-only pairs and groups or neighbourhoods were considered definite matches and were not sent for manual verification. As shown in Table 8.3.2, there were fewer interconnected record groups among the DL pairs than among the PL pairs. In Table 8.4.1.1, which shows the breakdown of DL-only pairs by province or territory and interprovincial pairs, there were also fewer interprovincial DL-only pairs than interprovincial PL-only pairs (1.39% from Table 8.4.1.1 versus 18.34% from Table 8.4.2.2). This was what would be expected for pairs that were true matches.
Provinces and territories | Frequency | Percent |
---|---|---|
Source: Statistics Canada, 2021 Census Overcoverage Study. | ||
Newfoundland and Labrador | 5,133 | 1.42 |
Prince Edward Island | 1,678 | 0.47 |
Nova Scotia | 9,145 | 2.54 |
New Brunswick | 8,230 | 2.28 |
Quebec | 77,599 | 21.54 |
Ontario | 122,913 | 34.12 |
Manitoba | 11,908 | 3.31 |
Saskatchewan | 13,159 | 3.65 |
Alberta | 36,290 | 10.07 |
British Columbia | 67,876 | 18.84 |
Yukon | 463 | 0.13 |
Northwest Territories | 494 | 0.14 |
Nunavut | 391 | 0.11 |
Interprovincial | 5,001 | 1.39 |
8.4.2 Probabilistic linkage-only stratum
Table 8.4.2.1 shows the number of pairs for each sampling unit type and an estimate of the number of sampling units needed to obtain approximately that many pairs in the sample. The allocation to pairs and groups or neighbourhoods was proportional to size.
Sampling unit types | Number of pairs | Number of sampling units | Sample size (in terms of pairs) |
Percent of sample (in terms of pairs) |
Sample size (in terms of sampling units) |
---|---|---|---|---|---|
Source: Statistics Canada, 2021 Census Overcoverage Study. | |||||
Group or neighbourhood | 6,411,761 | 1,796,937 | 28,110 | 52 | 9,599 |
Pair | 5,931,084 | 5,931,084 | 25,920 | 48 | 25,920 |
Total | 12,342,845 | 7,728,021 | 54,030 | 100 | 35,519 |
Probabilistic linkage-only pairs
The PL-only pairs were first stratified by intraprovincial and interprovincial pairs. Table 8.4.2.2 below gives the breakdown of intra- and interprovincial pairs among the PL-only pairs. Sample allocation to the intra- and interprovincial substrata was proportional to size.
Types of pairs | Frequency of pairs | Percent | Number of sampled pairs |
---|---|---|---|
Source: Statistics Canada, 2021 Census Overcoverage Study. | |||
Intraprovincial | 4,843,438 | 81.66 | 22,004 |
Interprovincial | 1,087,646 | 18.34 | 4,753 |
Within the intraprovincial pair stratum, a power allocation was used to allocate the PL-only pairs across provinces, with the measure of size taken to be the number of pairs in each province and q = ½. The pairs were then sorted by their conditional match probabilities, and a systematic sample was drawn. Note that the three territories were take-all. Table 8.4.2.3 shows the allocation of PL-only intraprovincial pairs by province or territory.
Provinces and territories | Frequency |
---|---|
Source: Statistics Canada, 2021 Census Overcoverage Study. | |
Newfoundland and Labrador | 568 |
Prince Edward Island | 300 |
Nova Scotia | 760 |
New Brunswick | 748 |
Quebec | 6,915 |
Ontario | 5,600 |
Manitoba | 730 |
Saskatchewan | 701 |
Alberta | 1,740 |
British Columbia | 2,655 |
Yukon | 383 |
Northwest Territories | 448 |
Nunavut | 456 |
Total sample size | 22,004 |
The PL-only interprovincial pairs were further stratified by unique province combination and allocated proportional to size. There were 78 unique province combinations among the interprovincial pairs. Within the provincial combination substrata, pairs were sorted by their conditional match probabilities, and systematic sampling was used to draw the sample.
Probabilistic linkage-only groups and neighbourhoods
For the groups and neighbourhoods, the pairs were first stratified by intraprovincial and interprovincial groups. A group was considered interprovincial if it contained at least one interprovincial pair. Table 8.4.2.4 shows the breakdown of intra- and interprovincial groups in the PL-only stratum. The sample was allocated proportional to size between the intra- and interprovincial strata.
Group types | Frequency of pairs | Percent | Number of sampled pairs | Number of sampled groups |
---|---|---|---|---|
Source: Statistics Canada, 2021 Census Overcoverage Study. | ||||
Intraprovincial | 4,070,750 | 63.49 | 17,970 | 6,656 |
Interprovincial | 2,341,011 | 36.51 | 10,140 | 3,022 |
Within the intraprovincial stratum, groups were allocated to provinces using a power allocation. Table 8.4.2.5 shows the allocation of PL-only intraprovincial sampling units by province or territory. As there were so few sampling units in the territories, these substrata were take-all. To better control the final sample size, the provincial strata were further stratified by group size in terms of the number of pairs in the group. The sample within each provincial stratum was allocated among group sizes proportional to the size. A minimum of one sampling unit was sampled within each stratum.
Group levels (provinces and territories) | Number of sampled pairs | Number of sampled groups |
---|---|---|
Note: The three territories are take-all strata.
Source: Statistics Canada, 2021 Census Overcoverage Study. |
||
Newfoundland and Labrador | 270 | 112 |
Prince Edward Island | 100 | 43 |
Nova Scotia | 439 | 187 |
New Brunswick | 450 | 184 |
Quebec | 7,594 | 2,514 |
Ontario | 4,949 | 1,899 |
Manitoba | 385 | 162 |
Saskatchewan | 342 | 143 |
Alberta | 1,119 | 467 |
British Columbia | 2,123 | 866 |
Yukon | 54 | 24 |
Northwest Territories | 68 | 27 |
Nunavut | 64 | 28 |
Total sample size | 17,957 | 6,656 |
The interprovincial group and neighbourhood stratum was divided into two substrata: those with a majority province or territory (i.e., a province or territory to which most pairs in the group belong) and those without a majority province or territory (i.e., the pairs within the group are split evenly among the provinces or territories involved). The breakdown of pairs by majority and no majority groups and neighbourhoods is given in Table 8.4.2.6.
Group types | Frequency of pairs | Percent | Number of sampled pairs | Number of sampled groups |
---|---|---|---|---|
Source: Statistics Canada, 2021 Census Overcoverage Study. | ||||
With a majority province or territory | 515,949 | 90 | 9,099 | 2,702 |
Without a majority province or territory | 57,923 | 10 | 1,041 | 422 |
Interprovincial groups and neighbourhoods with a majority province or territory were further stratified by dominant province or territory in the group and allocated using a power allocation. The sampling units within the provincial substrata were then stratified by the number of pairs in the groups. Allocation to group size was proportional to size. The sampling units were then sorted by expected overcoverage in the group and the proportion of intraprovincial pairs in the group, and a systematic sample was drawn. Because there were only 102 groups with a majority territory, these strata were take-all. A minimum of at least four sampling units were drawn from the other strata. Table 8.4.2.7 shows the allocation of PL-only interprovincial sampling units with a majority province or territory by majority province or territory.
Group levels (provinces and territories) | Number of sampled pairs | Number of sampled groups |
---|---|---|
Note: The three territories are take-all strata.
Source: Statistics Canada, 2021 Census Overcoverage Study. |
||
Newfoundland and Labrador | 282 | 91 |
Prince Edward Island | 127 | 37 |
Nova Scotia | 481 | 156 |
New Brunswick | 433 | 144 |
Quebec | 2,000 | 540 |
Ontario | 2,593 | 685 |
Manitoba | 346 | 108 |
Saskatchewan | 264 | 99 |
Alberta | 898 | 292 |
British Columbia | 1,399 | 448 |
Yukon | 129 | 49 |
Northwest Territories | 102 | 39 |
Nunavut | 36 | 14 |
Total sample size | 9,090 | 2,702 |
Groups with no majority province or territory were stratified by group size, and the sample was allocated proportional to size. Table 8.4.2.8 shows the allocation of PL-only interprovincial groups with no dominant province or territory by group size.
Number of pairs | Number of sampled pairs | Number of sampled groups |
---|---|---|
Source: Statistics Canada, 2021 Census Overcoverage Study. | ||
2 | 548 | 274 |
3 | 372 | 124 |
4 | 36 | 9 |
5 | 55 | 11 |
6 | 6 | 1 |
7 | 7 | 1 |
8 | 8 | 1 |
9 | 9 | 1 |
Total sample size | 1,041 | 422 |
8.4.3 Probabilistic linkage–deterministic linkage stratum
The breakdown of PL pairs and DL pairs in the PL–DL groups and neighbourhoods is shown in Table 8.4.3.1.
Linked by | Frequency | Percent |
---|---|---|
PL = probabilistic linkage DL = deterministic linkage Note: The term “probabilistic linkage-deterministic linkage” means some of the pairs in the group were identified by the probabilistic linkage only, while others were identified by the deterministic linkage. Source: Statistics Canada, 2021 Census Overcoverage Study. |
||
PL | 250,270 | 70.64 |
DL | 104,020 | 29.36 |
As previously mentioned, a sample of DL pairs was drawn during the DL step and sent for manual verification to evaluate the quality of DL pairs and ensure that all DL pairs could be classified as definite pairs of duplicate persons. This sample was referred to as the DL verification sample. To use the DL verification sample, groups to which these sampled pairs belonged were treated as take-all strata, and the corresponding PL pairs, and any corresponding DL pairs not part of the DL verification sample, were sent for manual verification.
There were 1,010 sampled DL pairs among the pairs in the PL–DL interconnected record groups. These pairs belonged to 929 groups. The breakdown of PL and DL pairs among these 929 groups is shown in Table 8.4.3.2.
Linked by | Frequency | Percent |
---|---|---|
PL = probabilistic linkage DL = deterministic linkage Note: The term “probabilistic linkage-deterministic linkage” means some of the pairs in the group were identified by the probabilistic linkage only, while the deterministic linkage identified others. Source: Statistics Canada, 2021 Census Overcoverage Study. |
||
PL | 2,553 | 71.31 |
DL | 1,027 | 28.69 |
There were 17 DL pairs and 2,553 PL pairs sent for manual verification. The 1,010 DL pairs that were part of the DL verification sample had already been verified. Hence, they were not sent for manual verification.
An additional sample of 533 groups (1,930 pairs) was selected from the PL–DL stratum. The PL–DL stratum was stratified by group-level province or territory and group size, and the sample was selected so that the full PL–DL sample was approximately proportional to size. The pair-level provincial breakdown of the full PL–DL sample is given in Table 8.4.3.3.
Provinces and territories | Frequency | Percent |
---|---|---|
Note: The term “probabilistic linkage-deterministic linkage” means some of the pairs in the group were identified by the probabilistic linkage only, while others were identified by the deterministic linkage.
Source: Statistics Canada, 2021 Census Overcoverage Study. |
||
Newfoundland and Labrador | 62 | 1.13 |
Prince Edward Island | 32 | 0.58 |
Nova Scotia | 87 | 1.58 |
New Brunswick | 105 | 1.91 |
Quebec | 1,583 | 28.76 |
Ontario | 1,683 | 30.57 |
Manitoba | 91 | 1.65 |
Saskatchewan | 83 | 1.51 |
Alberta | 311 | 5.65 |
British Columbia | 714 | 12.97 |
Yukon | 15 | 0.27 |
Northwest Territories | 15 | 0.27 |
Nunavut | 9 | 0.16 |
Interprovincial | 715 | 12.99 |
Total | 5,505 | 100.00 |
The provincial strata were further stratified by number of links, and a systematic sample was drawn.
8.4.4 Final sample sizes (by pairs)
Table 8.4.4.1 below shows the final sample sizes for the PL-only and PL–DL strata that were sent for manual verification. The DL-only stratum consisted of 360,280 pairs that were classified as definite pairs of duplicate persons.
Strata | Number of pairs by stratum | Number of sampled pairs sent for manual verification (after de-duplication for overlapping neighbourhoods) |
---|---|---|
PL = probabilistic linkage DL = deterministic linkage PL-DL = probabilistic linkage-deterministic linkage (some of the pairs in the group were identified by the probabilistic linkage only, while others were identified by the deterministic linkage) Source: Statistics Canada, 2021 Census Overcoverage Study. |
||
PL-DL (without 1,010 DL pairs that were part of the DL verification sample) | 92,494 | 4,495 |
PL-only interprovincial pairs | 1,087,646 | 4,753 |
PL-only intraprovincial pairs | 4,844,635 | 22,004 |
PL-only intraterritorial groups (take-all) | 4,070,750 | 186 |
PL-only intraprovincial groups | 484 | 18,153 |
PL-only interterritorial groups with a majority territory (take-all) | 266 | 266 |
PL-only interprovincial groups with a majority province | 2,184,665 | 8,815 |
PL-only interprovincial groups with no majority province or territory | 156,079 | 1,040 |
Total size | 12,437,019 | 59,712 |
8.5 Manual verification operation
The manual verification operation was a clerical operation and had several objectives:
- independently verify sampled pairs to determine whether they are overcoverage
- review the household members associated with the sampled pairs to potentially identify additional cases of overcoverage not on the COS frame
- code the potential cause of the overcoverage (i.e., overcoverage scenario).
Manual verification was done pair by pair. When a group or neighbourhood was sampled, all of the pairs that it contained were examined manually. However, coders were not provided the grouping information for the pairs in groups and neighbourhoods. Each pair was verified on its own. The pairs were examined only once, even if they belonged to more than one sampled neighbourhood.
The manual verification process consisted of a comprehensive examination of all available information on the RDB. As in 2016, it consisted of the following steps:
- comparing the sampled RDB persons based on the names, sex, birth date and relationships, as well as some additional information added in 2021
- comparing the RDB household members based on the same criteria
- weighing the evidence for or against overcoverage between two RDB persons and between two RDB households
- determining the overcoverage scenario if there was overcoverage (Table 8.5.1 provides a list of overcoverage scenario codes and their description).
Codes | Description |
---|---|
FRAME_ID = unique household identifier Source: Statistics Canada, 2021 Census Overcoverage Study. |
|
1.1 | Two different FRAME_IDs for the same household; same or similar address |
1.2 | Two different FRAME_IDs for the same household; different address |
2.1 | Child of parents in separate households |
2.2 | Child (age 0 to 17) with other relative(s) |
2.3 | Child (age 0 to 17) with other unrelated adult(s) |
3.1 | Student or young adult (age 18 to 24) newly away from home |
3.2 | Young adult (age 18 to 24) entering or leaving married or common law relationship |
3.3 | Young adult (age 18 to 24) with other relative(s) |
3.4 | Young adult (age 18 to 24) with other unrelated adult(s) |
4.1 | Adult (age 25 or older) newly away from home |
4.2 | Adult (age 25 or older) entering or leaving married or common law relationship |
4.3 | Adult (age 25 or older) with other relative(s) |
4.4 | Adult (age 25 or older) with other unrelated adult(s) |
5.1 | One household not a private dwelling |
6.1 | Intrahousehold overcoverage (same Frame_ID) |
7.1 | Other |
The sample was divided into batches of 500 household pairs (household A, household B). Each batch was assigned to a clerk (verifier), who examined and decided whether the selected person of household A was duplicated (overcovered) with the selected person of household B for each household pair in the batch. A selected pair of records was the sampled pair of interest. Furthermore, the verifier identified additional pairs of duplicate persons (if any) from each household pair and within each household.
When verifiers were uncertain of how to code a case, they were instructed to refer it to their supervisor, who in turn consulted with the Data Quality (DQ) team (a team of subject matter experts in the Coverage Measurement Section of the Statistical Integration Methods Division) or referred the case to the DQ team. In 2021, some complex sampled pairs were sent directly to the DQ team to verify. Complex sampled cases included
- intra-household cases, that is, when the pair is from within a single household (for example, the same person is listed twice)
- one-person households (when the pair comes from two different households, each of which has a household size of 1).
Experience from past cycles showed that these complex sampled cases required the expertise of the DQ team to code them properly. The DQ team was also able to consult additional sources of information to help make an accurate decision, such as consulting the current and/or past census cycle’s questionnaire data and using linkages conducted by the SDLE team at Statistics Canada. All sampled cases had to be coded with certainty, as no non-response was permitted.
Confidence in the coded results was required for the manual operation since the results directly contributed to the estimate of overcoverage. Thus, a 100% verification was implemented. This means two different verifiers coded the same batch. Once a batch had been coded by two different verifiers, their results were compared. All coded fields were compared. If any of the coding did not match, then the case was sent to the DQ team to make an informed decision. The 100% verification strategy ensured high-quality coded results, and continuous feedback was also provided to the clerks throughout the manual verification operation.
8.6 Weighting and estimation
8.6.1 Weighting
The initial weight of a sampling unit was simply the inverse of its selection probability. The sampling units that were groups and neighbourhoods varied in terms of the number of pairs they contained. These units were stratified by the number of pairs during sampling to better control the final sample size. However, for the interprovincial groups and neighbourhoods, the weighted provincial or territorial counts may have differed from what was on the frame. Therefore, a calibration step was added to ensure correct representation of the number of pairs in each province and territory. The sampling weights of the interprovincial groups and neighbourhoods were calibrated so that the estimated number of intraprovincial and interprovincial pairs in each province and territory matched the corresponding frame counts. Statistics Canada’s Generalized Estimation System (G-EST) was used to perform the calibration. Table 8.6.1.1 shows the calibration factors for each province and territory.
Provinces and territories | Intraprovincial | Interprovincial |
---|---|---|
Source: Statistics Canada, 2021 Census Overcoverage Study. | ||
Newfoundland and Labrador | 0.76 | 0.74 |
Prince Edward Island | 0.69 | 1.30 |
Nova Scotia | 1.01 | 0.96 |
New Brunswick | 0.99 | 0.88 |
Quebec | 0.98 | 1.08 |
Ontario | 0.99 | 0.99 |
Manitoba | 1.36 | 1.07 |
Saskatchewan | 1.13 | 0.90 |
Alberta | 1.08 | 1.08 |
British Columbia | 1.07 | 0.97 |
Yukon | 1.47 | 1.21 |
Northwest Territories | 1.40 | 0.42 |
Nunavut | 1.22 | 2.80 |
During the manual verification operation, verifiers identified cases of overcoverage in the households of sampled pairs that were not covered by the COS frame, and these pairs of duplicate persons were referred to as additional pairs of overcoverage found during manual verification. This occurred when the differences between the two records were too great for the pair to have been captured by the linkage processes. For example, if there were multiple typos, errors or too many differences in the fields used during the linkage process, the overcoverage pair was not on the COS frame.
This situation is illustrated in Figure 1 below. The oval with a blue outline represents the COS frame, while the oval with a green outline represents the target frame, which includes a small number of pairs that could not be captured with the linkage processes (i.e., the unobserved part of the target frame). The solid yellow oval represents the selected sample, which includes sampled person pairs, while the solid red oval represents the verified sample, which includes sampled person pairs and their household members. There are no weights directly associated with those pairs in the solid red oval that fall outside the COS frame (i.e., a small portion of the solid red oval falls in the unobserved portion of the target frame). The Generalized Weight Share Method (GWSM) (Lavallée, P. 2007) was used to assign weights from the weights of sampled pairs, through which these were indirectly sampled. Hence, all the additional pairs of overcoverage found during manual verification had a weight derived for them, and they were added to the sample for the purpose of estimation. This replaced the adjustment based on the AMS, which took into account overcoverage measured by the AMS outside the COS frame. This had been used since the 2006 COS.
Description for Figure 1
This figure consists of four ovals. An oval with a green outline is the largest and represents the target frame = Census Overcoverage Study frame + unobserved part. An oval with a blue outline is within the oval with a green outline and represents the Census Overcoverage Study frame. A solid red oval represents the verified sample and is situated fully within the oval with a green outline, with a small part of it outside the oval with a blue outline (i.e., a small portion of the solid red oval falls in the unobserved portion of the target frame). A solid yellow oval represents the selected sample and is situated fully inside the solid red oval, the oval with a blue outline and the oval with a green outline.
Source: Statistics Canada, 2021 Census Overcoverage Study.
There were some limitations associated with the way additional pairs of overcoverage were identified. Duplicated single-person households or duplicated persons whose other household members have nothing in common within the unobserved part of the target frame would not be captured by manually verifying all household members of a sampled pair. Thus, it is acknowledged that the 2021 COS may still not represent the entire target frame of duplicate persons in the census. This would have also been the case when using the AMS to adjust the COS in previous cycles. However, the unobserved portion of the target frame is expected to be extremely small.
8.6.2 Estimation
The results from the manual verification operation were processed to create overcoverage groups that were used for estimation. Overcoverage groups consisted of all RDB records that were linked together by verified overcoverage. The COS estimates were based on the sum of the overcoverage estimate counted in each overcoverage group. For an overcoverage group that was a pair, the overcoverage count was simply 1. If the overcoverage group was contained within a small group of records (i.e., a group not broken into neighbourhoods), then:
Overcoverage = number of records in overcoverage group – 1.
For overcoverage groups broken down into neighbourhoods, overcoverage was counted in the following two steps:
- Calculate overcoverage in each neighbourhood whose anchor (i.e., the RDB record acting as the centre of the neighbourhood) was involved in verified overcoverage for that overcoverage group as follows:
Overcoverage in the neighbourhood =
- Add up the neighbourhood overcoverage to obtain the total overcoverage in the overcoverage group.
Domain overcoverage was obtained by prorating the total pair, group or neighbourhood overcoverage by the proportion of RDB records in the given domain among those that belonged to the overcoverage group.
For interprovincial groups and neighbourhoods, the overcoverage calculated for a unit was multiplied by the calibrated weight to obtain the weighted estimate. Additional pairs of overcoverage found during manual verification were multiplied by their derived sampling weight from the use of the GWSM, to obtain the weighted estimate. Otherwise, the overcoverage calculated for a unit was multiplied by its initial sampling weight to obtain the weighted estimate. The variance of the estimate was calculated using G-EST.
8.7 Results
The 2021 COS estimated that 755,635 persons were enumerated more than once in the 2021 Census of Population. The results were examined by each of the components that led to the construction of the sampling frame and its contribution to the overall estimation of census overcoverage. Potential reasons why persons were counted more than once in the census were also examined.
8.7.1 Overcoverage by component
Each case of overcoverage (definite or manually verified) was characterized by the COS components that identified the pairs in its sampling unit. They are of four types:
- DL-only: all the pairs in the overcoverage group were identified by the DL
- PL-only: all the pairs in the overcoverage group were identified only by the PL
- PL–DL: some of the pairs in the group were identified by the PL only, while others were identified by the DL
- overcoverage manual verification (OCMV): all the pairs in the overcoverage group were additional pairs of duplicate persons found during manual verification that were not on the COS sampling frame and for which an indirect sampling weight was derived using the GWSM.
It is important to remember that pairs identified by both the PL and DL steps were classified as DL, so the “DL-only” category includes all the groups that are made up only of pairs that were identified by the DL, even though some of those same pairs could also have been identified by the PL.
Table 8.7.1.1 presents the number of overcoverage cases estimated by each of the COS components, as well as the percentage of the total estimated overcoverage that it represented, for Canada, as well as by province or territory.
Provinces and territories | DL-only | PL-only | PL-DL | OCMV | Total | |||||
---|---|---|---|---|---|---|---|---|---|---|
Estimated number | % of total | Estimated number | % of total | Estimated number | % of total | Estimated number | % of total | Estimated number | Standard error | |
DL-only = all the pairs in the overcoverage group were identified by the deterministic linkage PL-only = all the pairs in the overcoverage group were identified only by the probabilistic linkage PL-DL = some of the pairs in the group were identified by the probabilistic linkage only, while others were identified by the deterministic linkage OCMV = all the pairs in the overcoverage group were additional pairs of duplicate persons found during manual verification that were not on the Census Overcoverage Study sampling frame and for which an indirect sampling weight was derived using the generalized weight share method Note: Coverage estimates may not necessarily add up to the totals because of rounding. Source: Statistics Canada, 2021 Census Overcoverage Study. |
||||||||||
Canada | 352,059 | 46.6 | 318,459 | 42.1 | 81,172 | 10.7 | 3,946 | 0.5 | 755,635 | 9,648 |
Newfoundland and Labrador | 5,148 | 50.5 | 4,664 | 45.8 | 382 | 3.7 | 0 | 0.0 | 10,194 | 439 |
Prince Edward Island | 1,678 | 51.0 | 1,284 | 39.0 | 311 | 9.5 | 16 | 0.5 | 3,289 | 191 |
Nova Scotia | 9,200 | 47.6 | 8,412 | 43.5 | 1,639 | 8.5 | 94 | 0.5 | 19,344 | 736 |
New Brunswick | 8,079 | 48.8 | 6,890 | 41.7 | 1,440 | 8.7 | 132 | 0.8 | 16,541 | 641 |
Quebec | 76,760 | 42.1 | 80,126 | 43.9 | 25,242 | 13.8 | 385 | 0.2 | 182,513 | 5,915 |
Ontario | 120,765 | 44.7 | 118,619 | 43.9 | 29,382 | 10.9 | 1,334 | 0.5 | 270,100 | 6,888 |
Manitoba | 11,930 | 51.5 | 10,231 | 44.2 | 970 | 4.2 | 29 | 0.1 | 23,160 | 757 |
Saskatchewan | 13,210 | 54.7 | 9,501 | 39.3 | 1,194 | 4.9 | 258 | 1.1 | 24,163 | 689 |
Alberta | 36,902 | 47.3 | 35,085 | 44.9 | 5,527 | 7.1 | 570 | 0.7 | 78,084 | 2,736 |
British Columbia | 66,976 | 53.2 | 42,762 | 34.0 | 15,017 | 11.9 | 1,078 | 0.9 | 125,832 | 2,778 |
Yukon | 479 | 57.9 | 315 | 38.1 | 21 | 2.5 | 12 | 1.4 | 827 | 38 |
Northwest Territories | 508 | 60.6 | 293 | 35.0 | 25 | 3.0 | 12 | 1.4 | 837 | 15 |
Nunavut | 423 | 56.3 | 277 | 36.9 | 23 | 3.1 | 28 | 3.7 | 751 | 17 |
At the national level, the DL-only and PL-only components represented 46.6% and 42.1%, respectively, of the total estimate of overcoverage, while the PL–DL component represented 10.7%, and the OCMV component accounted for 0.5%.
The DL-only contribution to the total provincial or territorial estimate was higher for the Northwest Territories (60.6%) and Yukon (57.9%) and lower for Ontario (44.7%) and Quebec (42.1%). The PL-only contribution to the total provincial or territorial estimate was higher for Newfoundland and Labrador (45.8%) and Alberta (44.9%) and lower for the Northwest Territories (35.0%) and British Columbia (34.0%). As for the PL–DL component, its contribution was higher in Quebec (13.8%) and British Columbia (11.9%) and lower for the three territories (ranging from 2.5% to 3.1%). Lastly, for the OCMV component, its contribution was higher for the three territories (ranging from 1.4% to 3.7%) and lower for Manitoba (0.1%) and Newfoundland and Labrador (0.0%), where no additional pairs of duplicate persons were identified during the manual verification operation that were not already on the COS sampling frame.
8.7.2 Overcoverage by scenario
Table 8.7.2.1 shows the estimated overcoverage by potential reason why the overcoverage occurred, called the overcoverage scenario, at the national and provincial and territorial levels for 2021. It is important to mention that these results are not comparable to the 2016 overcoverage results by scenario for two reasons:
- The overcoverage scenario was coded during the manual verification operation. Since the DL-only pairs were considered as definite pairs of duplicate persons without manual verification, an overcoverage scenario is not available for those pairs.
- The codes used for the scenarios were modified for the 2021 cycle to improve the consistency of the coding and the usefulness of the results.
Excluding the DL-only cases, almost 25% of all overcoverage at the national level is between two identical households. This proportion is a little lower for Newfoundland and Labrador and higher for British Columbia.
When only overcoverage within non-identical households is considered and the DL-only cases are excluded again, the most frequent overcoverage scenario is a child enumerated by both parents in separate households, as was the case in 2016 and previous cycles. This is true for every province and territory, except for Nova Scotia and Nunavut. In Nova Scotia, the most frequent scenario was a student or young adult (age 18 to 24) newly away from home, while in Nunavut, it was a child
Provinces and territories | Overcoverage scenario | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1.1 | 2.1 | 2.2 | 2.3 | 3.1 | 3.2 | 3.3 | 3.4 | 4.1 | 4.2 | 4.3 | 4.4 | 5.1 | 6.1 | 7.1 | 8.1 | |
Identical households | Child of parents in separate households | Child |
Child |
Student or young adult |
Young adult |
Young adult |
Young adult |
Adult |
Adult |
Adult |
Adult |
One household not a private dwelling | Intrahousehold overcoverage (same FRAME_ID) | Other | Deterministic linkage | |
percent | ||||||||||||||||
FRAME_ID = unique household identifier Note: Overcoverage by scenario is estimated at the pair level rather than the group level hence there is a small difference in the percentages when compared to Table 8.7.1.1. Source: Statistics Canada, 2021 Census Overcoverage Study. |
||||||||||||||||
Canada | 12.5 | 11.3 | 0.8 | 0.3 | 5.8 | 1.2 | 0.5 | 0.7 | 3.6 | 3.5 | 3.9 | 1.6 | 2.4 | 0.6 | 3.2 | 48.2 |
Newfoundland and Labrador | 9.5 | 11.5 | 1.2 | 0.0 | 7.7 | 2.5 | 0.3 | 0.0 | 1.6 | 4.9 | 3.6 | 1.0 | 3.0 | 0.3 | 3.3 | 49.8 |
Prince Edward Island | 11.2 | 11.5 | 0.3 | 0.4 | 9.0 | 1.1 | 0.8 | 0.3 | 3.7 | 1.8 | 1.2 | 1.3 | 2.5 | 0.0 | 0.8 | 54.0 |
Nova Scotia | 10.8 | 9.9 | 2.0 | 0.0 | 15.5 | 1.9 | 0.0 | 0.6 | 1.9 | 3.0 | 2.1 | 1.2 | 1.0 | 0.4 | 1.8 | 47.9 |
New Brunswick | 11.0 | 10.5 | 1.7 | 0.5 | 6.4 | 3.4 | 0.4 | 0.4 | 2.8 | 4.5 | 2.7 | 0.4 | 2.2 | 0.0 | 3.1 | 50.0 |
Quebec | 12.1 | 16.6 | 0.7 | 0.2 | 6.0 | 1.7 | 0.6 | 0.5 | 4.1 | 4.7 | 3.9 | 0.9 | 2.3 | 0.7 | 2.9 | 41.9 |
Ontario | 13.0 | 10.6 | 0.5 | 0.1 | 5.4 | 0.6 | 0.2 | 0.9 | 4.1 | 3.0 | 4.9 | 1.6 | 2.3 | 0.7 | 3.0 | 49.2 |
Manitoba | 10.5 | 8.8 | 2.4 | 0.5 | 5.1 | 1.7 | 1.1 | 0.9 | 3.2 | 3.5 | 2.4 | 1.8 | 4.1 | 0.3 | 2.5 | 51.2 |
Saskatchewan | 9.3 | 11.1 | 1.5 | 0.8 | 4.0 | 0.6 | 1.5 | 1.1 | 2.4 | 1.5 | 3.7 | 1.7 | 3.1 | 0.2 | 2.8 | 54.6 |
Alberta | 11.5 | 9.3 | 0.7 | 0.8 | 6.6 | 2.1 | 0.5 | 0.5 | 3.3 | 3.8 | 3.9 | 2.5 | 2.7 | 0.5 | 4.4 | 47.0 |
British Columbia | 14.6 | 7.0 | 0.7 | 0.5 | 4.7 | 0.5 | 0.5 | 0.9 | 3.0 | 2.8 | 2.5 | 2.1 | 2.3 | 0.4 | 4.0 | 53.6 |
Yukon | 9.2 | 13.1 | 0.4 | 0.2 | 4.0 | 0.8 | 0.8 | 0.3 | 0.8 | 3.1 | 1.9 | 1.4 | 1.4 | 0.1 | 3.0 | 59.5 |
Northwest Territories | 10.8 | 8.2 | 2.1 | 1.0 | 1.4 | 0.7 | 0.6 | 0.5 | 0.9 | 2.7 | 2.9 | 2.7 | 1.8 | 0.5 | 1.7 | 61.4 |
Nunavut | 13.6 | 4.5 | 8.8 | 1.2 | 1.6 | 1.1 | 2.7 | 0.7 | 1.2 | 0.8 | 4.8 | 1.2 | 3.2 | 0.4 | 2.3 | 51.7 |
- Date modified: