AGGREGATE.py requires JSON / YAML files as inputfilter / expression may be almost arbitrarily complicatedAGGREGATE.py function should follow the following structure:variable name: the name of the final column
type: dynamic or static, relevant for aggregationtable: the table the data should be extracted fromvariable: if dynamic, this is the column to select from table (along with Time Relative to Admission (seconds)), if static this is the column to aggregate onvariable_sources: if variable is a lab value, select the wanted sources (put null if None shall be included)value_dtype: e.g. set to bool to get a binary variablecutoff: cutoffs for static variables are applied separately (i.e. locally for that var) from cutoffs for dynamic variables (globally for the full dataframe)
value: low (lo) / high (hi) cutoffs for the value, the value is clipped to these cutoffs (data is not removed); if one of the cutoffs isn’t set, it is assumed to be Nonetime: low (lo) / high (hi) cutoffs for the time, the time series is filtered to between these timepoints (data is removed); if one of the cutoffs isn’t set, it is assumed to be None.time_col: column to be used as time reference (Time Relative to Admission (seconds) if not specified)aggregation: method to use for aggregation, one of sum, mean, median, max, min, first, last, countsort: value to sort by for aggregation (Time Relative to Admission (seconds) if not specified)group_by: value(s) to group by for aggregation (i.e., columns to be used in addition to Global ICU Stay ID)requires: list of variables required (e.g. if there is a filter or expression)filter: expression for filtering the data (has to include the columns to filter as strings within strings, e.g. "pl.col('power') > 9000")prefilter: expression for filtering the table the data should be extracted from (has to include the columns to filter as strings within strings, e.g. "pl.col('power') > 9000")expression: expression to calculate using the data (has to include the columns used for calculation as strings within strings, e.g. "pl.col('a_squared') + pl.col('a_squared')")keep: bool, whether to keep the variable in the output frame{
"blood_sodium": {
"type": "dynamic",
"table": "timeseries_labs",
"variable": "Sodium [Moles/volume] in Blood",
"value_dtype": "float",
"cutoff": {
"value": {
"lo": 80,
"hi": 190
},
"time": {
"lo": 0
}
},
"keep": false
},
"first_hypernatremia_recordtime": {
"type": "static",
"requires": [
"blood_sodium"
],
"variable": "Time Relative to Admission (seconds)",
"aggregation": "min",
"filter": "pl.col('blood_sodium') > 145"
},
"blood_sodium_record_count": {
"type": "static",
"requires": [
"blood_sodium"
],
"variable": "blood_sodium",
"aggregation": "count"
}
}
… and results in the following exemplary DataFrame:
│ Global ICU Stay ID ┆ blood_sodium_record_count ┆ first_hypernatremia_recordtime │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 │
╞════════════════════╪═══════════════════════════╪════════════════════════════════╡
│ eicu-1000020 ┆ 3 ┆ 35100.0 │
│ eicu-1000050 ┆ null ┆ null │
│ eicu-1000071 ┆ null ┆ null │
│ eicu-1000105 ┆ 3 ┆ 113040.0 │
│ eicu-1000106 ┆ null ┆ null │