Data Collection Requirements for AI-Based Diabetes Management App
To train an effective AI model for diabetes management, you must collect comprehensive insulin pump data including basal and bolus delivery records with timestamps, alongside continuous glucose monitoring data, patient demographics, and contextual lifestyle information—as these interconnected data streams are essential for developing algorithms that can achieve >70% time-in-range and prevent dangerous glycemic excursions. 1
Core Insulin Pump Data Elements
Insulin Delivery Records
- Basal insulin delivery rates with precise timestamps, including any temporary basal rate adjustments made by the user or automated systems 1
- Bolus insulin doses with clear differentiation between meal boluses and correction boluses, though this distinction may require inference if not explicitly recorded 1
- Insulin-on-board (IOB) calculations to prevent insulin stacking and hypoglycemia when training dosing recommendation algorithms 2
- Total daily insulin dose (TDD) tracked over time, as this is fundamental for calculating correction factors (1500/TDD or 1700/TDD formulas) 2
Critical caveat: Insulin pump data often fails to clearly indicate whether insulin was administered for meals versus corrections, requiring algorithmic inference using the formula: inferred correction dose = (CGM_current - CGM_target)/correction factor, with meal insulin being the difference 1
Pump Settings and Parameters
- Correction factor (insulin sensitivity factor) settings and any adjustments over time 2
- Carbohydrate-to-insulin ratios (ICR) for different times of day 2
- Target glucose ranges set by the patient or clinician 1
- Insulin absorption duration parameters (typically 3-4 hours for rapid-acting insulin) 2
Continuous Glucose Monitoring (CGM) Data
Glucose Measurements
- High-frequency CGM readings at 5-minute intervals minimum, as this resolution captures detailed glucose dynamics that 15-60 minute intervals obscure 3
- Timestamps synchronized with insulin delivery data to enable accurate pattern recognition 1
- CGM manufacturer and model information, as different sensors use different algorithms that may introduce bias requiring algorithmic compensation 1
Important consideration: CGM data frequently has gaps due to wireless connectivity failures, sensor misplacement, or pressure-induced attenuation—your data pipeline must handle missing data through linear interpolation for gaps <20 minutes or exclusion for longer gaps 1
Glucose Variability Metrics
- Time-in-range (TIR) data (70-180 mg/dL target) as this is the primary efficacy outcome for ML algorithms 1, 4
- Time below range especially <54 mg/dL (<3.0 mmol/L) for hypoglycemia detection 5
- Glucose rate of change data, noting that physiologically plausible changes are limited to ±4 mg/dL/minute 1
Patient Demographics and Clinical History
Essential Demographic Data
- Age category (children, adolescents, adults) as glucose dynamics vary significantly across age groups and training sets must be balanced across populations 1
- Duration of diabetes, which influences insulin sensitivity and correction factor requirements 2
- Body weight for calculating insulin doses per kg/day and identifying when basal insulin exceeds 0.5 units/kg/day 2
Clinical Parameters
- HbA1c values tracked over time as a long-term glycemic control marker 1
- Changes in clinical status including illness, steroid use, or medication changes that temporarily alter insulin sensitivity 2
- History of severe hypoglycemia or diabetic ketoacidosis events 5
Lifestyle and Contextual Data
Meal and Nutrition Information
- Carbohydrate intake with timestamps, as meal announcements are critical for current automated insulin delivery systems and ML algorithms can achieve automated meal detection 1
- Meal timing patterns to enable prediction of glucose excursions during daytime when food is consumed 1
Physical Activity Data
- Exercise type (aerobic, resistance, interval) as different exercise modalities have distinct effects on glucose dynamics 1
- Exercise timing and duration, since aerobic exercise can cause sharp glucose drops and dangerous hypoglycemia 1
- Heart rate and accelerometry data from fitness trackers when available, as demonstrated in the T1-Dexi dataset 1
Additional Contextual Factors
- Sleep quality and patterns, as automated insulin delivery systems show primary benefit during overnight periods 1
- Stress levels and life events that impact glucose variability 1
- Pain levels when reported, as these affect metabolic responses 1
Data Quality and Structure Considerations
Handling Data Challenges
- Document all data imputation methods clearly, as interpolation on test data can lead to invalid accuracy estimates through data leakage 1
- Never report algorithm performance on interpolated values in test sets—only use actual measured data for validation 1
- Account for sensor calibration data from blood glucose meters separately from CGM readings 1
Dataset Balance Requirements
- Ensure training, validation, and test sets are balanced across demographic groups (age, sex, diabetes duration) to prevent poor performance on underrepresented populations 1
- Include sufficient inter- and intra-individual variability in glucose dynamics, as this variability is influenced by nutrition, lifestyle, medications, stress, and comorbidities 1
Device-Specific Considerations
- Record all device models used (CGM sensors, insulin pumps, smart pens) as algorithms may need manufacturer/model as input features to handle bias differences 1
- Track device failures including infusion set problems and sensor faults for anomaly detection training 1
Integration with Electronic Health Records
- Medication regimens beyond insulin, particularly those affecting glucose metabolism 1
- Comorbidities and underlying health conditions that influence glucose dynamics 1
- Laboratory values relevant to diabetes management 1
Key implementation note: The OpenAPS Data Commons dataset demonstrates the value of large-scale data collection, containing over 46,070 days of data with >10 million CGM data points alongside insulin dosing and algorithmic decisions—this scale enables robust ML model training 4