19 3.2.1.4 Identify concepts Identify and clarify the concept to be measured. This is a point where producer has to link and align users’ needs with existing statistical standards and concepts. 3.2.1.5 Check data availability Check availability of data from any source that will make it possible to address the identified needs. The aim here is to minimize the cost of collection and production of new statistics. If partial data is available from existing source, then the new production should be made just to compliment what is missing from the available data. It is important to conduct an exhaustive assessment to all possible sources to identify any existing data prior to decide on the designing of new data collection. If some data are available from other organization, it is better to set up legal arrangements of accessing those data through data sharing agreements or any mechanism to ensure that such data can be accessed and utilized for statistics production. 3.2.1.6 Prepare and submit business case Once needs for new data collection has been identified and existing data from other sources being reviewed and marked, the following step is to develop and submit a business case. This should be in the form of either complete proposal or a concept note or just Terms of References. In either form, it should be comprehensive and identify clearly the needs. The business case to be developed may include, but not limited to, the background, objectives, benefits, costs, deliverables, time frame, budget, required technical and human resources, risk assessment and impact on stakeholders for each option. 3.2.2 Design and build statistical program This phase involves the development, designing, building and testing the statistics production solution required to define the identified statistical needs outputs, concepts, methodologies, collection instruments and operational processes as proposed by the business case. For statistical outputs produced on a regular basis, this phase usually occurs for the first iteration and whenever improvement actions are identified. This phase makes substantial use of international and national
20
standards and guidelines in order to reduce the length and cost of the design and
build process, and enhance the comparability and usability of outputs.
The following guidelines explain how this phase could be successfully implemented:
3.2.2.1 Design outputs
The outputs are designed based on the objectives of the data collection and
user needs. Statistics are normally produced through indicators, therefore, all
indicators intended to be produced need to be clearly identified with their
associated metadata. Metadata will enable producer of statistics to identify
variables which are required to be involved in the questionnaire of form which
will be used for data collection.
3.2.2.2 Design variable description
The next process is to design all variables that need to be collected according
to indicators which need to be addressed. Consideration should also be made
to variables which will be computed using one or more variables which have
been already created. For example, if the interest is to collect age, the design
should consider design a variable which will satisfy the need of producing age
groups as well as single year age. Under such circumstance, the designed
variable should be single year from which age groups can be derived as a
computed variable. Some variables may not be included as core variables but
rather as classification or disaggregation variables. These may include, for
example, variable on sex, geographical location, etc.
3.2.2.3 Design collection tool
Once all variables have been identified and well designed, a collection tool
(questionnaire or form) is designed. Prior to design the questionnaire, a
response unit need to be clearly identified. If more than one response unit is
targeted (for example, an establishment and an individual) then separate
questionnaires should be designed, one for each response unit. Try to avoid
mixing questions for two different units of response in one questionnaire. The
questionnaire or form which will be designed at this process have to meet at
least the following characteristics:
i. Includes only necessary questions and avoid redundant questions;
21 ii. Have questions which are not ambiguous and easy to be understood by respondent; iii. Questions should be properly sequenced; iv. Have pre-coded questions as much as possible; and v. Have questions whose responses are exhaustive. The designed questionnaire or form may then be computerized so that it can be administered using Computer Assisted Personal Interview (CAPI). A comprehensive manual that explains details for each question in the form or questionnaire should also be developed to assist administration of that form during data collection, processing and analysis of data that will be collected. 3.2.2.4 Design processing and analysis The following process is to design and build statistical processing and analysis methodology. This process includes: i. Designing rules for coding; ii. Designing rules for editing and imputation; iii. Designing tabulation plan; iv. Designing and building dummy tables; and v. Designing report structure. 3.2.2.5 Design dissemination component Design and build dissemination products may involve designing of key findings report, brochures and leaflets, dashboard, video and audio clips. Presentation of statistics that has to be released during dissemination stage may be of different forms including tables, graphs and charts or maps. These should be clearly identified and if there are additional needs required for their preparation, they have to be identified and designed at this stage.
22 3.2.2.6 Test production system Once all design and building of statistics production system has been completed, the next stage is to test the system. Testing the system involves testing the data collection instruments (questionnaires or forms) to see if they work and return response as they are expected. Any discrepancy from the expected results should be addressed and rectified at this stage. It is important to pre-test the data collection instruments in environment at which they are going to be administered. 3.2.2.7 Test statistical business process The next process is to pilot a complete statistical business process. This is beyond just testing the instrument but rather includes all processes involved in the production system including logistics and administrative arrangement, budgets and even the look of the final outputs. 3.2.2.8 Finalise production systems Rectify any observed discrepancy of the pilot from the design. Changes should be made whenever necessary and everything then have to be finalised at this process. To complete this process, the following activities should be accomplished: i. Documentation of all processes; ii. Production of user manuals; and iii. Training users of various processes including data collection, editing, quality check and administration of each process. 3.2.3 Data collection Data collection of administrative records involves collecting or gathering all necessary information (e.g. data, metadata and paradata) using different collection modes (e.g. acquisition, collection, extraction, transfer), and loads them into the appropriate environment for further processing. Whilst it can include validation of data set formats, it does not include any transformations of the data themselves, as these are all done in the "Process" phase. For statistical outputs produced regularly, this phase occurs in each iteration.
23 The data collection phase of administrative data is implemented through the following three processes which are generally sequential, but can also occur in parallel, and can be iterative. These sub-processes are preparation, run collection and finalise collection. 3.2.3.1 Set up collection Set up collection to ensures that the people, processes and technology (e.g. CAPI, web-based applications, GPS system) are ready to collect data and metadata, in all modes as designed. It takes place over a period of time, as it includes the strategy, planning and training activities in preparation for the specific instance of the statistical business process. This process includes the following sub-processes: i. Preparing a collection strategy; ii. Training staff who will fill the form on the administrative data collection instruments and the system in general; iii. Ensuring collection resources are available (e.g. laptops, collection apps, APIs); iv. Configuring collection systems to request and receive the data; v. Ensuring the security of data to be collected; vi. Preparing collection instruments (e.g. printing data collection forms, pre-filling them with existing data, loading data collection forms and data onto responsible staff computers, APIs, web scraping tools); vii. Translating of materials into Kiswahili if necessary; and viii. Ensures that the necessary processes, systems and confidentiality procedures are in place, to receive or extract the necessary information from the source. This includes: • Evaluating requests to acquire the data and logging the request in a centralised inventory; • Initiating contacts with organisations providing the data, and sending an introductory package with details on the process of acquiring the data;
24 • Checking detailed information about files and metadata with the data provider and receiving a test file to assess if data are fit for use; and • Arranging secure channels for the transmission of the data. 3.2.3.2 Conduct collection The next following process is actual collection of data using the collection instruments to collect or gather the information which may include raw microdata or aggregates produced at the source, as well as any associated metadata. It can include the initial contact with providers and any subsequent follow-up or reminder actions. It may include manual data entry at the point of contact depending on the source and collection mode. It records when and how providers were contacted, and whether they have responded. Depending on the geographical frame and the technology used, geo-coding may need to be done at the same time as collection of the data by using inputs from GPS systems, putting a mark on a map, etc. This process also includes the management of the providers involved in the current collection, ensuring that the relationship between the statistical organisation and data providers remains positive, and recording and responding to comments, queries and complaints. Proper communication with reporting units and minimization of the number of non-responses contributes significantly to a higher quality of collected data. For administrative data, geographical or other non-statistical data, the provider is either contacted to send the information or sends it as scheduled. This process may be time consuming and might require follow-ups to ensure that data are provided according to the agreements. In case where data are published under an Open Data license and exist in machine-readable form, they may be freely accessed and used. This process also includes supervision and monitoring of data collection and making any necessary changes to improve data quality.
25
3.2.3.3 Finalise collection
Finalise collection by loading the collected data and metadata into a suitable
electronic environment for further processing. This may include manual or
automatic data capture, for example, using clerical staff or optical character
recognition tools to extract information from data collection forms, or converting
the formats of files or encoding the variables received from other organisations.
In cases where there is a physical collection instrument such as data collection
form, which is not needed for further processing, this process manages the
archiving of that material. When the collection instruments use software such
as an API or an app, this process also includes the versioning and archiving.
3.2.4 Data processing
Data processing phase describes the processing of input data and their preparation
for analysis. It is made up of sub-processes that integrate, classify, check, clean, and
transform input data, so that they can be analysed and disseminated as statistical
outputs. For statistical outputs produced regularly, this phase occurs in each
iteration. The following are processes which guide and can be applied to data from
administrative sources:
3.2.4.1 Integrate data
Integration of data can be from one or more sources. It is where the results of
sub process in the collect phase are combined. The input data can be from a
mixture of external or internal data sources, and a variety of collection modes,
including extracts of administrative data resulting in linked data. Data
integration include matching data from multiple sources and prioritising, when
two or more sources contain data for the same variable, with potentially
different values. Following integration, depending on data protection
requirements, data may be anonymised, that is stripped of identifiers such as
name and address to help protect confidentiality.
3.2.4.2 Classify and code Classification, coding and imputation of data where the coding should be done during the designing of data collection form or after collection using an automated process or an interactive manual process. For example, automatic