Purpose
This vignette details the structure, preparation, and simulation of datasets included in the yusufHAIGermany package. It describes the original source tables, the cleaning steps carried out, and the creation of harmonised monthly, weekly, and daily series used throughout the analysis vignettes. The aim is to provide transparency about data origins and to facilitate reproducibility.
Original Dataset
The original dataset was manually collected from the GHAI (Global Healthcare-Associated Infections) journal report for Germany (2011–2012). The report offers annual totals for five major infection categories: HAP, SSI, BSI, UTI, and CDI. Values include cases, deaths, and DALYs, with details by region (Germany versus EU/EEA) and sex where available.
These extracted values serve as the foundation for the simulation framework. Since the journal reports only aggregate annual totals, additional temporal detail had to be generated through synthetic, rule-based simulation.
orig_cols <- tribble(
~Variable, ~Description, ~Type,
"year", "Calendar year of report (2011–2012)", "integer",
"region", "EU/EEA or Germany", "character",
"hai", "HAI category (HAP, SSI, BSI, UTI, CDI)", "character",
"sex", "Male or Female (if reported)", "character",
"cases", "Annual total cases", "numeric",
"deaths", "Annual total deaths", "numeric",
"dalys", "Annual DALYs", "numeric"
)
kable(orig_cols, caption = "Table 1: Variables in the Original Extracted Dataset")| Variable | Description | Type |
|---|---|---|
| year | Calendar year of report (2011–2012) | integer |
| region | EU/EEA or Germany | character |
| hai | HAI category (HAP, SSI, BSI, UTI, CDI) | character |
| sex | Male or Female (if reported) | character |
| cases | Annual total cases | numeric |
| deaths | Annual total deaths | numeric |
| dalys | Annual DALYs | numeric |
Cleaning and Simulation Process
- Standardisation of category names: Infection types were harmonised across all source files.
- Consistency checks: Totals were validated to ensure that no missing values or structural inconsistencies existed.
- Creation of temporal structure: Because only annual values were available, synthetic frequency-specific series were generated. Monthly is proportionally allocated using infection-specific seasonal pattern, weekly is distributed using ISO week structure, and daily is generated using low-variance random noise around weekly means.
- Reproducible simulation rules: UTI assigned strong seasonality (higher in summer), HAP and SSI assigned mild winter peaks, o CDI allocated mild autumn variation, and o Random noise applied with controlled fixed seeds.
All simulation scripts are included in the package to maintain reproducibility.
Simulation Output Table
The simulation generated three consistent, tidy time-series datasets: monthly, weekly, and daily. Each includes harmonised fields for year, date, infection type, counts, and burden metrics.
Monthly Simulated Data
Synthetic monthly distributions were created for each HAI group to accurately reflect seasonal patterns while precisely matching the journal’s annual totals.
monthly_cols <- tribble(
~Variable, ~Description, ~Type,
"date", "First day of each month", "Date",
"year", "Calendar year", "integer",
"month", "Month number (1–12)", "integer",
"hai", "HAI category", "character",
"region", "EU/EEA or Germany", "character",
"sex", "Male or Female (if applicable)", "character",
"cases_month", "Simulated monthly cases", "numeric",
"deaths_month", "Simulated monthly deaths", "numeric",
"dalys_month", "Simulated monthly DALYs", "numeric"
)
kable(monthly_cols, caption = "Table 2: Variables in the Monthly Simulated Dataset")| Variable | Description | Type |
|---|---|---|
| date | First day of each month | Date |
| year | Calendar year | integer |
| month | Month number (1–12) | integer |
| hai | HAI category | character |
| region | EU/EEA or Germany | character |
| sex | Male or Female (if applicable) | character |
| cases_month | Simulated monthly cases | numeric |
| deaths_month | Simulated monthly deaths | numeric |
| dalys_month | Simulated monthly DALYs | numeric |
Weekly Simulated Data
Weekly counts were calculated using ISO week standards to ensure epidemiological consistency. Weekly figures accurately sum up to the monthly and yearly totals.
weekly_cols <- tribble(
~Variable, ~Description, ~Type,
"date", "Start date of ISO week", "Date",
"year", "Calendar year", "integer",
"iso_week", "ISO week number (1–52/53)", "integer",
"hai", "HAI category", "character",
"region", "EU/EEA or Germany", "character",
"sex", "Male or Female", "character",
"cases_week", "Simulated weekly cases", "numeric",
"deaths_week", "Simulated weekly deaths", "numeric",
"dalys_week", "Simulated weekly DALYs", "numeric"
)
kable(weekly_cols, caption = "Table 3: Variables in the Weekly Simulated Dataset")| Variable | Description | Type |
|---|---|---|
| date | Start date of ISO week | Date |
| year | Calendar year | integer |
| iso_week | ISO week number (1–52/53) | integer |
| hai | HAI category | character |
| region | EU/EEA or Germany | character |
| sex | Male or Female | character |
| cases_week | Simulated weekly cases | numeric |
| deaths_week | Simulated weekly deaths | numeric |
| dalys_week | Simulated weekly DALYs | numeric |
Daily Simulated Data
Daily values were produced by allocating weekly totals with low-variance jitter, ensuring smooth short-term trends without altering the overall burden.
daily_cols <- tribble(
~Variable, ~Description, ~Type,
"date", "Daily date stamp", "Date",
"year", "Calendar year", "integer",
"yday", "Day of year (1–365/366)", "integer",
"wday", "Day of week (1–7)", "integer",
"hai", "HAI category", "character",
"region", "EU/EEA or Germany", "character",
"sex", "Male or Female", "character",
"cases_day", "Simulated daily cases", "numeric",
"deaths_day", "Simulated daily deaths", "numeric",
"dalys_day", "Simulated daily DALYs", "numeric"
)
kable(daily_cols, caption = "Table 4: Variables in the Daily Simulated Dataset")| Variable | Description | Type |
|---|---|---|
| date | Daily date stamp | Date |
| year | Calendar year | integer |
| yday | Day of year (1–365/366) | integer |
| wday | Day of week (1–7) | integer |
| hai | HAI category | character |
| region | EU/EEA or Germany | character |
| sex | Male or Female | character |
| cases_day | Simulated daily cases | numeric |
| deaths_day | Simulated daily deaths | numeric |
| dalys_day | Simulated daily DALYs | numeric |
Summary
Overall, the simulation framework generates a coherent, multi-frequency dataset aligned with epidemiological practices, enabling flexible analysis across monthly, weekly, and daily time frames. By starting from validated journal totals and applying structured synthetic rules, the resulting datasets are stable, internally consistent, and suitable for time-series visualisation, comparative analysis, and workload modelling. Additionally, the license used for this dataset is the Creative Commons Attribution 4.0 International License (CC BY 4.0) as described on the License Page.