ACREC Lecture
Vilnius University
2026-01-24
Why Big Data?:
Characteristics of Big Data:
| Surveys | Big Data | |
|---|---|---|
| Observations | Sample (n) | Entire Population (N) |
| Sources | Single questionnaire | Multiple sources |
| Data collection | Structured interviews (humans) | Algorithms (computers) |
| Data type | Subjective (self-reflective) | Objective (behavioral) |
| Data generation | Controlled (for evaluation) | Heterogeneous (for other purposes) |
| Data analysis | Traditional statistics | Traditional statistics and machine learning |
| Validity | Sampling error | Predictive power |
Source: “Introduction to Big Data for Social Science Research” (2019)
Big Data is often generated for a purpose other than research or evaluation. Therefore it suffers from:
The cost of collecting data has dropped dramatically. However, there is still a high cost in structuring big datasets.
Often, we will find data in a very messy format, this requires us to clean and organize it.
This may include manual verification, cross-data-set comparisons, statistical testing, and data engineering.
We can (and should) use a combination of both traditional statistics an machine learning approaches.
Examples
Methods
Violations of integrity, fraud, and corruption result in reduced quality, affordability and availability of public services.
There is a presistent need to proactively and systematically identify, precisely and comprehensively measure, and effectively mitigate corruption.
Big Data methods allow researchers and policy analysts to used administrative data a to develop a granular, scalable, and actionable corruption indicators.
The aim of institutionalised corruption is to steer the contract to the favoured bidder without detection in a recurrent and organised fashion.
Corruption in public procurement requires at least two violations of principles of fair distribution of public resources:
Public procurement data sets are structured collections of data that capture the details of contracts awarded by government agencies or departments to private entities.
These data sets are designed to ensure transparency, accountability, and efficiency in the use of public funds. Similarly, procurement data is often used to measure corruption risks across various sectors (Fazekas and Hernández Sánchez 2021; Czibik et al. 2021).

The Water Integrity Risk Index (WIRI) is made up of procurement data indicators as well as survey-based measures of direct corruption experience to identify and assess integrity risks in the urban W&S sector.
It is a is micro-level analysis, narrowly focuses on the W&S sector at the level of settlements and is both transparent and replicable.
We identify three main pillars of integrity in the W&S sector:
Given that integrity is a latent variable, we must rely on proxy indicators which can, in conjunction, reveal integrity risks. We calculate the composite WIRI with the following steps:
We assign each public procurement contract to one of the 3 pillars using product codes (e.g. CPV) or descriptions.
The public procurement indicator is a composite score of five elementary risk metrics:
We employ survey data to construct the indicator on direct bribery experiences in the W&S sector.
For each of the available survey, we calculate the rate of bribery by dividing the number of respondents who admitted bribery over the total number of respondents who required or requested a W&S service in a settlement.
For example, in Kenya, we rely on the Afro-barometer, collecting positive responses from a representative sample of the population of settlements in the country who admit to bribing to obtain water services.
