IN GENERAL
- One does not have to iterate over folders of patient files anymore!
- All patients are harmonized within the large
.parquet files within reprodICU_files, with a smaller DEMO dataset (made from the MIMIC-III/-IV / eICU demos) being available within reproDEMO
- Any kind of data manipulation / selection can now happen lazily without the need for processing the full database every time
- Use
polars instead of pandas
- Due to much clearer declaration of processing steps (mostly via
.pipe()), resulting code is now much more readable
- One can also easily precalculate pipe-able functions for the full dataset, such that the save results may also be loaded lazily (thus increasing efficiency even more)
- Any imputation or preprocessing that happens after the first basic harmonization step is now explicitly declared and can be reproduced independently of a full database rebuild
- Some variables and their locations have moved:
- There is no table
flats anymore
- Also
height, weight, age etc. are just given as raw values (or imputed / winsorized ones when using processing steps)
- The table
labels was renamed patient_information and now contains additional information (where available) such as
- ethnicity
- pre-ICU stay duration (in days)
- in-ICU mortality
- in-hospital mortality
- post-ICU mortality (truthy, in days)
- admission type
- admission urgency
- admission category
- specialty
- The table
timeseries was split according to timeseries categories and approximate frequency of logging to reduce needed space
vitals contains values such as heart rate, blood pressure and O2 saturation by pulseoximetry
respiratory contains values concerning ventilation
labs contains lab values (incl. blood gases)
inout contains intake / output volumes