Data Description • yusufHAIGermany

Purpose

This vignette details the structure, preparation, and simulation of datasets included in the yusufHAIGermany package. It describes the original source tables, the cleaning steps carried out, and the creation of harmonised monthly, weekly, and daily series used throughout the analysis vignettes. The aim is to provide transparency about data origins and to facilitate reproducibility.

Original Dataset

The original dataset was manually collected from the GHAI (Global Healthcare-Associated Infections) journal report for Germany (2011–2012). The report offers annual totals for five major infection categories: HAP, SSI, BSI, UTI, and CDI. Values include cases, deaths, and DALYs, with details by region (Germany versus EU/EEA) and sex where available.

These extracted values serve as the foundation for the simulation framework. Since the journal reports only aggregate annual totals, additional temporal detail had to be generated through synthetic, rule-based simulation.

orig_cols <- tribble(
~Variable, ~Description, ~Type,
"year", "Calendar year of report (2011–2012)", "integer",
"region", "EU/EEA or Germany", "character",
"hai", "HAI category (HAP, SSI, BSI, UTI, CDI)", "character",
"sex", "Male or Female (if reported)", "character",
"cases", "Annual total cases", "numeric",
"deaths", "Annual total deaths", "numeric",
"dalys", "Annual DALYs", "numeric"
)

kable(orig_cols, caption = "Table 1: Variables in the Original Extracted Dataset")

Table 1: Variables in the Original Extracted Dataset
Variable	Description	Type
year	Calendar year of report (2011–2012)	integer
region	EU/EEA or Germany	character
hai	HAI category (HAP, SSI, BSI, UTI, CDI)	character
sex	Male or Female (if reported)	character
cases	Annual total cases	numeric
deaths	Annual total deaths	numeric
dalys	Annual DALYs	numeric

Cleaning and Simulation Process

Standardisation of category names: Infection types were harmonised across all source files.
Consistency checks: Totals were validated to ensure that no missing values or structural inconsistencies existed.
Creation of temporal structure: Because only annual values were available, synthetic frequency-specific series were generated. Monthly is proportionally allocated using infection-specific seasonal pattern, weekly is distributed using ISO week structure, and daily is generated using low-variance random noise around weekly means.
Reproducible simulation rules: UTI assigned strong seasonality (higher in summer), HAP and SSI assigned mild winter peaks, o CDI allocated mild autumn variation, and o Random noise applied with controlled fixed seeds.

All simulation scripts are included in the package to maintain reproducibility.

Simulation Output Table

The simulation generated three consistent, tidy time-series datasets: monthly, weekly, and daily. Each includes harmonised fields for year, date, infection type, counts, and burden metrics.

Monthly Simulated Data

Synthetic monthly distributions were created for each HAI group to accurately reflect seasonal patterns while precisely matching the journal’s annual totals.

monthly_cols <- tribble(
~Variable,        ~Description,                                    ~Type,
"date",           "First day of each month",                       "Date",
"year",           "Calendar year",                                 "integer",
"month",          "Month number (1–12)",                           "integer",
"hai",            "HAI category",                                  "character",
"region",         "EU/EEA or Germany",                             "character",
"sex",            "Male or Female (if applicable)",                "character",
"cases_month",    "Simulated monthly cases",                       "numeric",
"deaths_month",   "Simulated monthly deaths",                      "numeric",
"dalys_month",    "Simulated monthly DALYs",                       "numeric"
)

kable(monthly_cols, caption = "Table 2: Variables in the Monthly Simulated Dataset")

Table 2: Variables in the Monthly Simulated Dataset
Variable	Description	Type
date	First day of each month	Date
year	Calendar year	integer
month	Month number (1–12)	integer
hai	HAI category	character
region	EU/EEA or Germany	character
sex	Male or Female (if applicable)	character
cases_month	Simulated monthly cases	numeric
deaths_month	Simulated monthly deaths	numeric
dalys_month	Simulated monthly DALYs	numeric

Weekly Simulated Data

Weekly counts were calculated using ISO week standards to ensure epidemiological consistency. Weekly figures accurately sum up to the monthly and yearly totals.

weekly_cols <- tribble(
~Variable,        ~Description,                                      ~Type,
"date",           "Start date of ISO week",                           "Date",
"year",           "Calendar year",                                    "integer",
"iso_week",       "ISO week number (1–52/53)",                        "integer",
"hai",            "HAI category",                                     "character",
"region",         "EU/EEA or Germany",                                "character",
"sex",            "Male or Female",                                   "character",
"cases_week",     "Simulated weekly cases",                           "numeric",
"deaths_week",    "Simulated weekly deaths",                          "numeric",
"dalys_week",     "Simulated weekly DALYs",                           "numeric"
)

kable(weekly_cols, caption = "Table 3: Variables in the Weekly Simulated Dataset")

Table 3: Variables in the Weekly Simulated Dataset
Variable	Description	Type
date	Start date of ISO week	Date
year	Calendar year	integer
iso_week	ISO week number (1–52/53)	integer
hai	HAI category	character
region	EU/EEA or Germany	character
sex	Male or Female	character
cases_week	Simulated weekly cases	numeric
deaths_week	Simulated weekly deaths	numeric
dalys_week	Simulated weekly DALYs	numeric

Daily Simulated Data

Daily values were produced by allocating weekly totals with low-variance jitter, ensuring smooth short-term trends without altering the overall burden.

daily_cols <- tribble(
~Variable,        ~Description,                                      ~Type,
"date",           "Daily date stamp",                                 "Date",
"year",           "Calendar year",                                    "integer",
"yday",           "Day of year (1–365/366)",                          "integer",
"wday",           "Day of week (1–7)",                                "integer",
"hai",            "HAI category",                                     "character",
"region",         "EU/EEA or Germany",                                "character",
"sex",            "Male or Female",                                   "character",
"cases_day",      "Simulated daily cases",                            "numeric",
"deaths_day",     "Simulated daily deaths",                           "numeric",
"dalys_day",      "Simulated daily DALYs",                            "numeric"
)

kable(daily_cols, caption = "Table 4: Variables in the Daily Simulated Dataset")

Table 4: Variables in the Daily Simulated Dataset
Variable	Description	Type
date	Daily date stamp	Date
year	Calendar year	integer
yday	Day of year (1–365/366)	integer
wday	Day of week (1–7)	integer
hai	HAI category	character
region	EU/EEA or Germany	character
sex	Male or Female	character
cases_day	Simulated daily cases	numeric
deaths_day	Simulated daily deaths	numeric
dalys_day	Simulated daily DALYs	numeric

Summary

Overall, the simulation framework generates a coherent, multi-frequency dataset aligned with epidemiological practices, enabling flexible analysis across monthly, weekly, and daily time frames. By starting from validated journal totals and applying structured synthetic rules, the resulting datasets are stable, internally consistent, and suitable for time-series visualisation, comparative analysis, and workload modelling. Additionally, the license used for this dataset is the Creative Commons Attribution 4.0 International License (CC BY 4.0) as described on the License Page.