Guide to the Census of Population, 2021
Chapter 8 – Processing

Introduction

The step after collection, known as the processing phase, began on April 26, 2021, with the process of editing and coding responses for approximately 17 million private and collective dwellings.

Receipt and registration

For the 2021 Census, electronic responses from online questionnaires were received from the Collection Management Portal (CMP) and registered in the Census Processing System (CPS) hourly before entering the edit and coding workflow. The CPS also registered interviewer responses received through the Census Help Line, non-response follow-up (NRFU) and failed edit follow-up (FEFU) on a regular basis during collection and follow-up.

Paper questionnaires that were returned by mail were registered in Canada Post sorting plants by scanning the barcode on the front of the questionnaire visible through the return envelope window before delivery to the Data Operations Centre (DOC). To confirm receipt by Statistics Canada, the questionnaires were removed from the envelopes and registered again at the DOC using the manual Check-in Station(s). Whenever Canada Post was unable to read the barcodes (for instance, when questionnaires were inserted into envelopes backwards), the questionnaires were removed from the envelopes and the barcode scanned when the envelopes were delivered to Statistics Canada.

Registrations of all questionnaires from Canada Post were transmitted to the CMP on an hourly basis. Census employees were notified (via the CMP) of which questionnaires had been received so that they could stop contact for these respondents during NRFU procedures.

Paper questionnaires that were completed by census employees during NRFU were shipped by their supervisors (crew leaders) directly to the DOC where they were registered. All such questionnaires were then data captured similar to other paper responses.

Imaging and data capture

Once paper questionnaires were registered, the next step was document preparation and scanning for data capture of responses.

Steps

  1. Document preparation—Mailed-back questionnaires were removed from envelopes. In order to ensure that questionnaires were ready to be scanned, operators removed foreign objects such as clips and staples from the documents. Questionnaires were also cut into single sheets using guillotines (large paper cutters).
  2. Scanning—Scanning, using high-speed scanners, created digital images from the paper questionnaires.
  3. Automated image quality assurance—An automated system verified the quality of the scanning for capture purposes. Images failing this process were flagged at Document Analyst, and an operator made a determination on best action for the capture of the form.
  4. Automated data capture—Optical mark recognition and optical character recognition were used to extract respondent data. When the system could not recognize the handwriting (known as the write-ins), keying was done by an operator from the scanner images. Paper questionnaires that could not be scanned (e.g., too damaged) or were filled out with a pen or pencil that could not be read by the automated capture systems, were sent for transcription (i.e., the data were transcribed to a new form).
  5. Check-out—This quality assurance process ensured that the questionnaire images and captured data were of sufficient quality and that the paper questionnaires were no longer required.

Edits

As the paper questionnaires were captured and the online questionnaires received, an interactive process of manual and automated edits was performed to ensure that problems and inconsistencies were identified and resolved.

  1. Blank and minimum content—This automated edit identified questionnaires with no information or insufficient information to continue processing. These cases were returned to the field for non-response follow-up by census employees.
  2. Multiple responses—A household may have multiple questionnaires (e.g., large households require more than one paper questionnaire to complete the census). This automated edit identified households with one or more missing questionnaires. These cases were held in a queue until all questionnaires were received.
  3. Coverage edits—These edits were conducted for both private and collective dwellings and ensured that the reported number of members of a household was consistent with the responses provided, including the number of names listed. Errors were resolved by an automated process or through interactive verification by Data Operations Centre (DOC) employees by manually examining the captured data and scanned images (where available) to help determine the appropriate solution.
  4. Failed edit follow-up—Short-form questions that needed further coverage or content clarification were transmitted to the Statistics Canada regional offices for failed edit follow-up collection and transmitted back to the Census Processing System for subsequent DOC processing.

Coding

Written responses to census questions were converted to numerical codes before they could be tabulated for analysis and release purposes. For the 2021 Census, all written responses on the questionnaires underwent automated and interactive coding to assign each one a numerical code using reference files, code sets and standard classifications.

The automated coding was completed using Statistics Canada’s Generalized Coding Tool (G-Code). A preprocessing step was first completed to prepare raw write-in text strings for autocoding. These text strings were then matched against reference files built by subject matter experts using actual responses from past censuses. Write-ins with an exact match on the reference file were assigned that code.

Remaining write-ins were then presented to a machine learning (ML) model that was trained using high-quality coded data and reference files agreed upon by subject matter experts and methodologists. The ML algorithm assigned each record a code and confidence score. Matches with confidence above identified thresholds were assigned that code.

Write-ins still without a code were sent to interactive coding applications where they were assigned codes by specially trained coding operators and subject matter experts.

Subject matter experts then reviewed all coded records and certified these codes before delivery to edit and imputation.

Response database

Once data have successfully passed through each processing step at the Data Operations Centre, they were loaded into the response database (RDB).

The RDB is the microdata holding of all captured responses (paper and electronic questionnaires) during processing. The database has three categories of files:

The RDB is hosted in an Oracle environment that provides security features to ensure confidentiality and control accessibility and usage. Every user needs to be granted access through the Corporate Access Request System to be able to work with these data.

The RDB is a data repository whose primary purpose is to serve as an input to the edit and imputation database, and it is also used for archival purposes where a copy is stored at Library and Archives Canada.

Edit and imputation

Data collected in any survey or census will contain invalid, inconsistent or missing responses. These errors can be the result of respondents missing or misunderstanding a question, or they can be generated during processing.

Edit and imputation activities begin once data capture, coverage edits and FEFU operations have ended, and the RDB is deemed as complete, consistent, and as free of processing errors as possible. Edit and imputation represents the last processing step before the census data are delivered for dissemination purposes.

In its first phase, the census data from private households is run through the Whole Household Imputation (WHI), which resolves census total non-response before edit and imputation begins. Each of these dwellings either gets imputed as occupied or unoccupied based on the Dwelling Classification Survey results, leading to the provision of population and dwelling counts to Statistics Canada’s Statistical Geomatics Centre. Besides the occupancy status, the WHI also imputes a household size, as well as a few demographic characteristics from administrative data if available, and searches for a donor household to donate its data for the remaining missing variables.

The second phase sees all data processed through a series of deterministic and donor modules for each topic, all run in a specific sequence using the Canadian Census Edit and Imputation System (CANCEIS). Modules detect and adjust for invalid, inconsistent or partial non-responses.

Deterministic imputation corrects systematic errors or errors that have only one solution based on subject matter experience. When many solutions are possible to solve an error, donor imputation is used. The latter method, also called nearest neighbour, is widely used in the treatment of non-response. It replaces missing, invalid or inconsistent information about one respondent with values from another, “similar” respondent. The rules for identifying the respondent most similar to the non-respondent may vary with the variables to be imputed. Donor imputation methods have good properties and generally will not alter the distribution of the data, a drawback of many other imputation techniques. Nearest neighbour imputation makes sure that any imputed value is consistent with the values of other variables.

A few Structured Query Languages (SQL) and statistical analysis system (SAS) modules are also part of the census edit and imputation processing flow.

Modules also generate a number of data quality flags, such as a non-response flag and imputation flag. These flags will be used at the estimation stage to derive various quality indicators for their corresponding census question.

For more information relating to the imputation rates, refer to the Data quality evaluation and Dissemination chapters.

Date modified: