Select Page

Today, we’re going to foray into the world of Apple.

I am a diabetic and wear a continuous glucose monitor, or CGM for short. It records my blood sugar every few minutes and communicates the results to my phone. Conveniently, I can share my data with my doctor, and together, we discuss specific trends and patterns and make truly data-driven decisions.

Naturally, I am interested in the data collected by the app. Since Dexcom G6 (my CGM brand) doesn’t have a csv export function, I worked around that limitation by syncing the data with Apple Health. The problem is Apple Health does not allow the user to filter what data is exported. Consequently, the export file that is generated is a monolith. If you’ve had your phone for at least a year, the export file is so large that it’s almost impossible to open. Almost!

I accepted the challenge: find a way to read the file and play with it.

Python, to the rescue!


We will approach this challenge like a good ol’ data science project. First, we will use the PAPEM-DM framework to produce insightful graphics and a clean csv file.

Here are the two major obstacles:

  • The file is in XML format.
  • The file is so large that XML readers and converters online can’t even open it.
  • We do not know what the schema is.

And here’s our approach:

  1. Read the XML file into a dataframe
  2. Whittle down the data into just Dexcom G6 data
  3. Clean up the dataframe a little bit
  4. Export the dataframe into a csv
  5. Conduct feature engineering
  6. Conduct Exploratory Data Analysis (make pretty graphs)
  7. Predict whether blood sugar is high, normal, or low
  8. Do some time-series forecasting

Acquisition of Apple Health Data from XML to CSV

It’s time to fire up a Jupyter notebook and get our data!

First, let’s import some Python packages.

# setting the random seed for reproducibility
import random

# for reading the xml file
import xml.etree.ElementTree as ET

# for manipulating dataframes
import pandas as pd
import numpy as np

# for working with timestamps
import datetime as dt
from dateutil.parser import parse

# for modeling
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn import metrics

# for time-series forecasting
import matplotlib.pyplot as plt
from prophet import Prophet

# for visualizations
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

# to print out all the outputs
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

# set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
pd.set_option('display.max_colwidth', None)

Second, let’s create the element tree object [1].

# create element tree object
tree = ET.parse('export.xml')

# extract the attributes for every health record
root = tree.getroot()
record_list = [x.attrib for x in root.iter('Record')]

Third, let’s convert the element tree object into a dataframe. [1]

# create the dataframe
df = pd.DataFrame(record_list)

Below, we can see the odd names on the type column.

Screenshot by the Ednalyn C. De Dios

Let’s fix that!

# shorter observation names
df['type'] = df['type'].str.replace('HKQuantityTypeIdentifier', '')
df['type'] = df['type'].str.replace('HKCategoryTypeIdentifier', '')

Here we go:

Screenshot by the Ednalyn C. De Dios

Our dataframe is big!

Screenshot by the Ednalyn C. De Dios

Let’s search for what we’re after.

Screenshot by the Ednalyn C. De Dios

Let’s filter for it.


Then, let’s further whittle down the size of our dataframe by keeping only the necessary columns and rows.

df1 = df.loc[df['type'] == 'BloodGlucose']
df1 = df1[['startDate', 'value']]


Screenshot by the Ednalyn C. De Dios

Now, let’s do a little bit of cleaning. Below, we change the data type of the startDate column to datetime and force the value column to be of the numeric datatype. Lastly, we fill the blank fields with 1 for those records that do not necessarily measure anything.

df1["startDate"] = pd.to_datetime(df1["startDate"])

# value is numeric, NaN if fails
df1['value'] = pd.to_numeric(df1['value'], errors='coerce')

# some records do not measure anything, just count occurences
# filling with 1.0 (= one time) makes it easier to aggregate
df1['value'] = df1['value'].fillna(1.0)

Finally, it’s time to export our dataframe into a csv file. Let’s name it “blood_glucose.csv.”

df1.to_csv('blood_glucose.csv', index=False)

Preparation of Apple Health Data from Dexcom G6

Now, let’s get ready to do some wrangling.

First, let’s rename ‘startDate’ with ‘timestamp’ for clarity. Then, we’ll extract the time components and attributes from the timestamp column.

# rename date column
df2 = df1.rename(columns={'startDate':'timestamp'})

# extract time components from the timestamp column
df2['year'] = df2.timestamp.dt.year
df2['month'] = df2.timestamp.dt.month
df2['day'] =
df2['hour'] = df2.timestamp.dt.hour

# extract time attributes from the timestamp column
df2['day_of_year'] = df2.timestamp.dt.dayofyear
df2['week_of_year'] = df2.timestamp.dt.weekofyear
df2['day_of_week'] = df2.timestamp.dt.dayofweek

We will also extract the weekday characteristic from the day of the week column and derive the part of the day based on the hour column.

# extract weekday characteristic from the day of the week column
df2['weekday'] = np.where(df2['day_of_week'] < 5, True, False)

# extract part of the day from the hour column
def get_day_period(x):
if x in range(6,12):
return 'Morning'
elif x in range(12,18):
return 'Afternoon'
elif x in range(18,23):
return 'Evening'
return 'Late night'

df2['part_of_day'] = df2['hour'].apply(get_day_period)

Next, we will categorize whether a record’s blood glucose is high, normal, or low.

# extract blood glucose level from the value column
def get_status(row):
if row['value'] < 70:
val = "Low"
elif row['value'] <= 180:
val = "Normal"
val = "High"
return val

df2['status'] = df2.apply(get_status, axis=1)

Looking at the value counts, we can see a massive imbalance between the status categories.

Screenshot by the Ednalyn C. De Dios

I’ve decided to forgo the low status and create another column indicating whether a reading is high or not.

# creates a new column and designates a row as either high or low
df2['high'] = np.where(df2['status'] != 'High', '0', '1').astype('int32')

And now, we’re ready to drop timestamp and status columns because we no longer need them.

df2 = df2.drop(columns=['status', 'timestamp'])

Let’s look at our dataframe now.

Screenshot by the Ednalyn C. De Dios

Exploratory Data Analysis of Apple Health Data from Dexcom G6

We’re about to do some visualizations, so let’s set up our figure size and layout.

plt.rcParams["figure.figsize"] = [10, 6]
plt.rcParams["figure.autolayout"] = True

Before we get too deep in the weeds, let’s check out the years we’re working with. Let’s do a value_counts().

show_values(df2, ['year'])

We can see, though, that the years are out of order, so let’s just manually create a list of the years in the correct chronological order.

years = [2020, 2021, 2022, 2023]

Next comes the exciting part. Let’s make some violin plots! Below we’re defining a function that will create some violin plots for us based on what part of the day it is, the blood glucose value, and whether or not it is a weekend. Here’s the code to do all that:

def show_violin(df, year):
ax = sns.violinplot(x="part_of_day",
data=df[df['year'] == year],
order=['Morning', 'Afternoon', 'Evening', 'Late night'])
ax.set_title('Distribution of Blood Glucose Value per Part of the Day ' + str(year), fontsize=16)

# Set label for x-axis
plt.ylabel( "Blood Glucose" , size = 12 )
plt.xlabel( "Part of the Day" , size = 12 )

plt.legend(title='Weekday?', loc='upper right')

And the snippet that will make create individual violin plots based on the year:

for year in years:
show_violin(df2, year)

And the result:

Screenshot by the Ednalyn C. De Dios


But what if we want to compare all the years together? Let’s do the same thing again, except lay the graphs on top of each other and see what they look like. In the code, it translates into omitting the line in the function. Here’s the code:

def show_violin_no_legend(df, year):
ax = sns.violinplot(x="part_of_day",
data=df[df['year'] == year],
order=['Morning', 'Afternoon', 'Evening', 'Late night'])
ax.set_title('Distribution of Blood Glucose Value per Part of the Day Over the Years', fontsize=16)

# Set label for x-axis
plt.ylabel( "Blood Glucose" , size = 12 )
plt.xlabel( "Part of the Day" , size = 12 )

And the result:

Screenshot by the Ednalyn C. De Dios


Let’s pat ourselves on the back. We managed to pack a lot of dimensions into one pretty graph!

Exploratory Data Analysis is a process with no one correct way to do it. So we’ll leave the rest of the usual suspects (distribution, average over time graphs, et cetera) and move on to modeling.

Modeling — Logistic Regression

First, let’s drop the value column since we will use the high column as the target variable.

df3 = df2.drop(columns=['value'])

Since we will be doing a classification task to predict whether a reading will be high, let’s create dummy variables for our categorical variables.

categorical_vars = ['month', 'weekday', 'part_of_day']

for var in categorical_vars:
cat_list = pd.get_dummies(df3[var], prefix=var)

to_keep=[i for i in data_vars if i not in categorical_vars]


Let’s check our dataframe for class imbalance:

show_values(df4, ['high'])

As the above shows, there are a lot more non-highs than highs. This could lead to some inaccuracy with our model so we will fix that with SMOTE. A technique used in machine learning to address the problem of imbalanced datasets, SMOTE (Synthetic Minority Over-sampling Technique) works by creating synthetic data points for the minority class by interpolating between existing minority class data points. It does this by randomly selecting a minority class data point and finding its k-nearest neighbors. It then randomly chooses one of those neighbors and creates a new data point by interpolating between them.

The new data point is created by selecting a random point on the line segment that connects the minority data point and its chosen neighbor. This process is repeated until the minority class is balanced with the majority class.

The result is a balanced dataset, which allows machine learning algorithms to be trained more effectively. Furthermore, by creating synthetic data points, SMOTE ensures that the classifier is not biased towards the majority class, which can improve its performance on the minority class.

Here’s the implementation of SMOTE borrowed from Susan Li’s article on Logistic Regression. [2]

X = df4.loc[:, df4.columns != 'high']
y = df4.loc[:, df4.columns == 'high']

os = SMOTE(random_state=493)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=493)

columns = X_train.columns
os_data_X,os_data_y=os.fit_resample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns )
os_data_y= pd.DataFrame(data=os_data_y,columns=['high'])

# we can check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no subscription in oversampled data",len(os_data_y[os_data_y['high']==0]))
print("Number of subscription",len(os_data_y[os_data_y['high']==1]))
print("Proportion of no subscription data in oversampled data is ",len(os_data_y[os_data_y['high']==0])/len(os_data_X))
print("Proportion of subscription data in oversampled data is ",len(os_data_y[os_data_y['high']==1])/len(os_data_X))

Next, we’ll conduct feature selection by practicing common sense and relying on intuition to select which columns to use to predict high blood glucose.

'part_of_day_Late night',

Then, we’ll divide our dataframe into independent and dependent variables, X and y.


Afterward, we’ll do a train-test-split and create a Logistic Regression object, which we will then fit with our data.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=493)

logreg = LogisticRegression(solver="liblinear"), y_train)

At long last, we’re ready to predict!

y_pred = logreg.predict(X_test)
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X_test, y_test)))

We’ll even add the confusion matrix and other metrics to evaluate the model’s performance.

confusion_matrix = confusion_matrix(y_test, y_pred)

And the result:

Screenshot by the Ednalyn C. De Dios

And since we’re doing binary classification, we will examine the ROC (Receiver Operating Characteristic) and AUC (Area Under the Curve).

An ROC curve is a graph that shows the trade-off between the true positive rate (sensitivity) and the false positive rate (1 — specificity) of a binary classifier as the classification threshold is varied. In other words, it shows how well the classifier can distinguish between positive and negative examples at different levels of certainty.

The ROC curve is created by plotting the true positive rate against the false positive rate for each possible threshold value. The AUC is the area under the ROC curve, which ranges from 0 to 1. An AUC of 1 means that the classifier can perfectly distinguish between positive and negative examples, while an AUC of 0.5 means that it is no better than random guessing.

Screenshot by the Ednalyn C. De Dios

In plain English, ROC AUC provides a way to measure how well a binary classifier can correctly distinguish between positive and negative examples, regardless of the threshold used. A high AUC score indicates the good performance of the classifier, while a low score indicates poor performance. It is a valuable tool for evaluating and comparing the performance of different classifiers, especially when the dataset is imbalanced or the cost of false positives and false negatives is different.

Time-Series Forecasting — Prophet

For our last stint, we will use Prophet for time-series forecasting. Prophet is a time-series forecasting tool developed by Facebook that is designed to be user-friendly and flexible. When using Prophet for time-series forecasting, there are several things to keep in mind:

  • The data should be in a specific format: Prophet requires time-series data in a particular format, with a column named ‘ds’ for dates/times and a column named ‘y’ for the values to be predicted.
  • Outliers should be handled: Outliers can significantly impact the accuracy of forecasts, so it’s essential to identify and handle them appropriately.
  • Seasonality and trend components should be accounted for: Prophet is designed to capture seasonality and trend components in time-series data, so it’s essential to include them in the model if they exist in the data.
  • Holidays and events should be included: Prophet allows you to include holidays and events that may affect the time-series data, such as national holidays or marketing campaigns.
  • Multiple time-series can be included: Prophet can handle multiple time-series with different seasonalities and trends, allowing you to model and forecast them simultaneously.
  • Forecast uncertainty should be considered: Prophet provides uncertainty intervals for its forecasts, which should be considered when making decisions based on the forecasts.
  • Hyperparameters should be tuned: Prophet has several hyperparameters that can be adjusted to improve the accuracy of forecasts, such as the number of Fourier terms used to capture seasonality.

Overall, Prophet can be a powerful tool for time-series forecasting, but it’s essential to carefully consider the data and model settings to ensure accurate and useful predictions. As such, we will import the csv file we saved earlier to focus on our dataframe before all the transformations.

df5 = pd.read_csv('blood_glucose.csv', parse_dates=True, infer_datetime_format=True)

Next, we’ll rename the columns to ‘ds’ and ‘y’ as Prophet needs.

df5 = df5.rename(columns={'startDate':'ds', 'value':'y'})

We’ll also ensure that the ‘ds’ columns are in a datetime format. With tx_localize(None), we’re also removing the timezone part of the timestamp since Prophet doesn’t like it.

df5["ds"] = pd.to_datetime(df5["ds"])
df5['ds'] = df5['ds'].dt.tz_localize(None)

Then, we will resample our data to account for the average of the daily readings.

df6=df5.set_index('ds').resample('D').agg(y=('y', 'mean'))
df6 = df6.reset_index()

And we’re ready:

model = Prophet()
df6_forecast = model.make_future_dataframe(periods=12, freq='MS')
df6_forecast = model.predict(df6_forecast)
plt.figure(figsize=(18, 6))
model.plot(df5_forecast, xlabel = 'Timestamp', ylabel = 'Glucose')
plt.title('Blood Glucose')
Screenshot by the Ednalyn C. De Dios

And for the components:

Screenshot by the Ednalyn C. De Dios


We’ve come a long way from importing Apple Health data to transforming it to discover insights, predict high blood sugar, and forecast blood glucose.

Along the way, we learned how to use the datetime object, make violin plots, balance an imbalance dataset with SMOTE, classify with logistic regression, and forecast with Prophet.

Next, we’ll play with the cleaned data and use Power BI to create a dashboard.

Stay tuned!

Thanks for stopping by and reading my post. I hope this step-by-step walkthrough helps you!