STUDY CASE

A hospital ABC collected data in patients with a given cancer. The initial follow-up visit, which took place before treatment initiation, allowed to collect clinical and radiomics data for each patient. Then, all patients were treated with the same treatment Z and followed until progression, death or censoring.


PRIMARY OBJECTIVE : Treatment response was evaluated 3 months after treatment initiation. Clinicians’ question is whether it is possible to predict, before treatment initiation, if a given patient will or won’t respond to treatment.


SECONDARY OBJECTIVE: Each patient was followed after treatment initiation until progression, death. Some patients were censored (e.g., lost of follow-up). Clinicians want to predict Progression-Free Survival (PFS) probability over time for each patient, based on the information collected at treatment initiation time.


### Import all libraries
library(readr) ## import data
library(dplyr) ## Create summary tables
library(DT) ## table aesthetic
library(ggplot2) ## for graphs
library(scales) ## To get proper plot ticks
library(grid)  ## For several plots on one image
library(reshape2) ## to go from matrix to relationship table
library(caret) ## For machine leraning algorithms
# library(car) ## To get Variance Inflation Factors
# library(bestNormalize) ## data transformation (to gaussian distributions)
library(glmnet) ## for LASSO
# library(ggRandomForests) # For Random Survival Forest


### Import dataset
### Before importing, replaced commas with dots and semi colons with commas in csv for better reading
cancer_dat <- read_csv(paste0(here::here(),"/data_challenge_updated.csv"), 
                       col_types = cols(ID = col_skip()))
cancer_dat = as.data.frame(cancer_dat)

## initial data treatment
cancer_dat$Treatment.response= as.factor(cancer_dat$Treatment.response)
cancer_dat$Sex = as.factor(cancer_dat$Sex)

PART I: TREATMENT PREDICTION

Treatment response was evaluated 3 months after treatment initiation. Clinicians’ question is whether it is possible to predict, before treatment initiation, if a given patient will or won’t respond to treatment.


DESCRIPTIVE ANALYSIS

In this part of the study, we get familiar with the dataset at hand. First we look at what the variables represent and how they are distributed. Then, we propose new features that could be pertient for our future model(s). finally, we start looking at the different variables’ 2 by 2 relatonships; first focusing on connections between the target variable and the others, then observing the correlation between features.


Univariate Analysis

Question 1. “Perform a descriptive analysis of the data.”

The dataset involves 300 patients (rows) with the following outcome1: Treatment.response (“No”: non-responder, “Yes”: responder).2

The other columns depict individual prognostic features3 :

1) Demographics features:

  • Sex (1: Female, 2: Male)
  • Age (in years)
  • Weight (in kg)

2) Radiomics Features4:

a) Morphological features describe the shape of the traced region of interest -ROI- (or lesion) and its geometric properties

  • Surface.cm2: the lesion’s computed surface (in cm2)
  • Volume.cm3: The lesion’s computed volumed (in cm3)
  • Recist.cm : Value/Criterion (in cm) derived from RECIST5 (Response Evaluation Criteria In Solid Tumors)

b) Intensity-based statistical features describe how intensities within the region of interest (ROI) are distributed (after investigation of the Image biomarker standardisation initiative document it is assumed the variables with names ending with “value.stat” correspond to Intensity-based statistical features).

  • Number.of.grey.levels: Number of grey levels found in radio’s ROI
  • Min.value.stat: Minimum value of the voxels’ intensity distribution in the ROI
  • Max.value.stat: Maximum value of the voxels’ intensity distribution in the ROI
  • Median.value.stat: Median value of the voxels’ intensity distribution in the ROI
  • Skewness.stat: Skewness of the voxels’ intensity distribution in the ROI
  • Standard.deviation.stat: Standard deviation of the voxels’ intensity distribution in the ROI

c) Grey level co-occurrence based features include the so-called textural features, which are obtained by calculating the statistical inter-relationships between neighbouring voxels. Note: GLCM stands for Grey Levels Co-ocurrence Matrix.

  • Contrast.GLCM: Assesses grey level variations. Large difference in grey levels within ROI increase feature value.
  • Homogeneity.GLCM (or Inverse difference moment): Large difference in grey levels within ROI lowers feature value. Maximal if all grey levels are the same.
TABLE_NUM = 0
### function to present tables in documents
show_table <- function(temp_sum, title= "", rownames = FALSE, tab_num = TABLE_NUM ) {
    TABLE_NUM <<- tab_num +1 # 2 "<" instead of one to create global variable inside functions

  DT::datatable(temp_sum, 
                caption = paste0('Table ',TABLE_NUM,". ", title),
                rownames = rownames, filter = "top",class = 'cell-border stripe', editable = TRUE, 
                extensions = 'Buttons', options = list(dom = 'rtip', #dom = 'lBfrtip',
                                                       #buttons = c('copy', 'csv', 'excel',  'print'),
                                                       autoWidth=TRUE, 
                                                       scrollX=TRUE,
                                                       lengthChange=TRUE,
                                                       pageLength=20))
}

### function to get summary of qualitative
get_qualitative_summary <- function(data) {
  categoricaldata=data.frame(matrix(ncol=5))
  columns = c("Variables","Variable_Levels","Frequency", "Percentage","Distinct_Values")
  colnames(categoricaldata)= columns
  for(i in 1:length(data)){
    if(is.factor(data[,i])&!all(is.na(data[,i]))){
      Summ=as.data.frame(summary(data[,i], maxsum = 20, stats = TRUE))
      Categories=rownames(Summ)
      Percentage=round((Summ*100)/sum(Summ),4)
      Variable=rep(colnames(data)[i],nrow(Summ))
      Distinct.Values=length(unique(data[,i]))
      table=cbind(Variable,Categories,Summ,Percentage,Distinct.Values)
      colnames(table)=columns
      categoricaldata=rbind(categoricaldata,table)
    }
  }
  categoricaldata=categoricaldata[-1,]
  return(categoricaldata)
}


### function to get summary of quantitative features
get_quantitative_summary <- function(data) {
  
  Quant_feats = colnames( select_if(data, is.numeric) )
  quant_sum = data.frame(Quant_feats, Min=NA,  Mean=NA, Median=NA,Max=NA,
                         Missing_Observations=NA)
  
  for( i in 1:length(Quant_feats)){
    quant_sum$Min[i] = min( data[,Quant_feats[i]], na.rm=T) 
    quant_sum$Mean[i] =  round(mean( data[,Quant_feats[i]], na.rm = T),3)
    quant_sum$Median[i] =  median(data[,Quant_feats[i]], na.rm = T)
    quant_sum$Max[i] =  max( data[,Quant_feats[i]], na.rm = T)
    quant_sum$Missing_Observations[i] =  sum(is.na(data[,Quant_feats[i]]))
    
  }
  return(quant_sum)
}

Summary Tables

Summary tables help us get familiar with the different variable’s distributions, and are a good way to spot missing values and outliers.

Find below a summary table of the numerical variables in the dataset:

# Get summary table output
quant_sum = get_quantitative_summary(cancer_dat)
show_table(quant_sum, title = "Quantitative Summary")
# Create new Weight variables, correct Recist and Number of grey levels variables
cancer_dat = cancer_dat %>% mutate(
  Missing.weight = ifelse(is.na(Weight), TRUE, FALSE),
  Number.of.grey.levels = ifelse(Number.of.grey.levels <=0, median(Number.of.grey.levels, na.rm=T),Number.of.grey.levels)
)

It gives us an initial overview of the numerical features, with their minimum and maximum values, along with their average and median values and the number of missing observations. We note that:

  • Weight variable has 200 missing values out of 300 observations, which is significant. This is problematic, as Weight is suspected to be an significant variable for our predictive model.
    • First we need to understand if these values are Missing at random (MAR) or Not missing at random (NMAR). Eg. it is possible that this data is missing when patients refuse to provide it, which is NMAR and is information in itself. We create a Missing.weight boolean variable to capture this.
    • Then, we need to find a way to impute these missing values. This is something we cover in the Preprocessing step.
  • Recist.cm variable has 10 missing values (also covered in Preprocessing step)
  • Number.of.grey.levels has at least one negative value (see Min column above) which does not make sense for this variable. It looks like an entry error and it is hard to assume anything about what the real value could be. We replace it with the variable’s median.
  • The minimum weight is 37kg, which is suspicious because there is no children in the data (minimum age is 29 yo).

Next, a summary table of the categorical variables in the dataset:

# Get summary table output
qual_sum = get_qualitative_summary(cancer_dat)
show_table(qual_sum, title = "Qualitative Summary")

There are only two of qualitative variables. The table lists the different classes for these factors, along with their frequency, percentage and distinct values. Missing values are treated as classes, hence we can see that there is none here.

We can observe that the target variable is imbalanced (Response to Treatment does not appear in the same proportions). This will have implications later during modelling.


Distribution Plots

## Function to plot simple distributions of quantitative features
get_distribution_hist = function(data, column){

  p=ggplot(data) +
    #"darkgreen" "forestgreen" '#08519c' 'dodgerblue4'
    geom_density(aes_string(x= column),
                 fill = 'forestgreen', alpha = 0.4) +
    geom_histogram(aes_string(x= column,y="..density.."),
                   colour= "black",fill= 'dodgerblue4',
                   alpha = 0.7,bins = 25) +
    geom_vline(xintercept= mean(data[,column], na.rm=T),size=0.8,linetype=5) +
    ggtitle(paste0("Distribution of ",column))+
    labs( x=column, y="Density")+
    scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
    theme(
      text = element_text(size=14,face="bold"),
      plot.title = element_text(hjust = 0.5,size=16),
      axis.text.x =element_text(hjust=1,vjust=0.9),
      plot.margin = unit(c(0.8,0.8,0.8,0.8), "cm"))

  return(p)
}

## Work in Progress
# get_distribution_hist = function(data, column){
#   
#   p=ggplot(data) +
#     #"darkgreen" "forestgreen" '#08519c' 'dodgerblue4'
#     geom_density(aes_string(x= column, y = "..count.."),
#                  fill = 'forestgreen', alpha = 0.4) +
#     geom_histogram(aes_string(x= column,y="..count.."),
#                    colour= "black",fill= 'dodgerblue4',
#                    alpha = 0.7,bins = 25) +
#     geom_vline(xintercept= mean(data[,column], na.rm=T),size=0.8,linetype=5) +
#     ggtitle(paste0("Distribution of ",column))+
#     labs( x=column, y="Count")+
#     scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
#     scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
#     theme(
#       text = element_text(size=14,face="bold"),
#       plot.title = element_text(hjust = 0.5,size=16),
#       axis.text.x =element_text(hjust=1,vjust=0.9),
#       plot.margin = unit(c(0.8,0.8,0.8,0.8), "cm"))
#   
#   return(p)
# }


## Function to plot simple distributions of quantitative features
get_distribution_bar = function(data, column){
  p=ggplot(data) +
    #"darkgreen" "forestgreen" '#08519c' 'dodgerblue4'
    geom_bar(aes_string(x= column),
             colour= "black",fill= 'dodgerblue4', alpha = 0.8) +
    
    ggtitle(paste0("Distribution of ",column))+
    labs( x=column, y="Frequency")+
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
    theme(
      text = element_text(size=14,face="bold"),
      plot.title = element_text(hjust = 0.5,size=16),
      axis.text.x =element_text(hjust=1,vjust=0.9),
      plot.margin = unit(c(0.8,0.8,0.8,0.8), "cm"))
  
  return(p)
}

# Save current quantitative variable names
Quant_feats = colnames(cancer_dat %>% select_if(is.numeric))
# Save current qualitative variable names
Quali_vars = colnames(cancer_dat %>% select_if(function(x) is.factor(x) | is.logical(x)))

## Save all resulting plots in list object
gg0 <- list()
for( i in Quant_feats){
  gg0[[i]] = get_distribution_hist(cancer_dat, i)
}
for( i in Quali_vars){
  gg0[[i]] = get_distribution_bar(cancer_dat, i)
  
}

To get a better understanding of how the variables are distributed, we plot them in different tabs:

# Save current variable names
Variables = colnames(cancer_dat)

# Display variable distribution plots in different tabs
for (Var in Variables) {
  cat("##### ",Var,"\n")
  print(gg0[[Var]])
  cat('\n\n')
}
Treatment.response

Sex

Age

Weight

Recist.cm

Surface.cm2

Volume.cm3

Number.of.grey.levels

Min.value.stat

Max.value.stat

Median.value.stat

Skewness.stat

Standard.deviation.stat

Contrast.GLCM

Homogeneity.GLCM

Missing.weight

Barplots are used to display categorical distributions, and histograms and juxtaposed density plots are used for numerical distributions.

We can observe that :

  • Several of the features are right-skewed, in particular Surface.cm2, Volume.cm3, Max.value.stat and contrast.GLCM.
  • On the other hand, Number.of.grey.levels looks more normally distributed.
  • Regarding Weight, after all 37kg does not look like an outlier, however on the other side of the distribution, 116kg looks like one (second highest value is 96kg).
  • Age also seems to display an outlier, with the lowest value being 29yo when the second youngest patient is 41yo.
  • Some radiomic features like Surface and Volume also seem to display outliers, however it is hard to tell for sure at this stage due to their important skewness.

In both Weight and Age cases, we have no reason to believe these outliers are due to wrong data entry, hence replacing them with the median value is not appropriate. Also, we do not wish to exclude them as the amount of data available is limited. We decide to use Winzorisation6 on these variables as it is a good way to make them less influential while preserving information.

## apply windzorisation to weight and age variables
cancer_dat = cancer_dat %>% mutate(
  Weight = DescTools::Winsorize(Weight,probs=c(0.00,0.99), na.rm=T, type = 1),
  Age = DescTools::Winsorize(Age,probs=c(0.01,1), type = 1)
)

Producing New Variables

Given the amount of observations in our data, and the need for explainability of the model, we will not evaluate all 2 by 2 interactions during modeling step as this would reduce the number of degrees of freedom, affect the robustness of our models, and also create high correlations between variables which can make the assessment of features’ importance difficult if not impossible due to multicollinearity. However, we can ‘manually’ derive new biomarkers if we believe these can be pertinent variables to add to our model. Hence, we proceed to create the following radiomics features:

  • Surface to Volume ratio (cm-1)
  • Compactness 1 which is chosen arbitrarily among several similar variables to quantify the deviation of lesion’s volume from a representative spheroid
  • Intensity Range ( Intensity Interquartile Range would probably be a more appropriate measure as less affected by outliers, however we do not have Intensity percentiles available )
## Add new variables
cancer_dat = cancer_dat %>% mutate(Surface.volume.ratio = Surface.cm2/Volume.cm3,
                                   Compactness.1 = Volume.cm3/(pi^(1/2)*Surface.cm2^(3/2)),
                                   Intensity.range = Max.value.stat - Min.value.stat)

## update quantitative and qualitative variable names
Quant_feats = colnames(cancer_dat %>% select_if(is.numeric))
Quali_vars = colnames(cancer_dat %>% select_if(function(x) is.factor(x) | is.logical(x)))
Features = colnames(cancer_dat)[colnames(cancer_dat)!="Treatment.response"]

The distribution of these new features is displayed in the plots below:

# Save current variable names
Variables = c("Surface.volume.ratio","Compactness.1","Intensity.range")

# Save plots in list object
gg0 <- list()
for( i in Variables){
  gg0[[i]] = get_distribution_hist(cancer_dat, i)
}

# Display variable distribution plots in different tabs
for (Var in Variables) {
  cat("##### ",Var,"\n")
  print(gg0[[Var]])
  cat('\n\n')
}
Surface.volume.ratio

Compactness.1

Intensity.range

Again we observe that these variables display a right-skewed distribution.


Bivariate Analysis

In this step of the analysis we examine the 2 by 2 relationships of the different variables in the dataset.


Between Target and Explanatory Variables

Question 2. “Is there any difference between non-responder and responder patients according to their pre-treatment individual features?”

First, we focus on the connections between the dependent variable (Treatment.response) and each independent variable.

These relationships are plotted in the tabs below, using boxplot and density visualisations for numeric features and barplots for categorical features.

## Function to plot boxplot distributions of quantitative features relative to categorical response
get_distribution_boxplot = function(data, column, cat_response){
  
  # boxplot code
  p=ggplot(data,aes_string(x=cat_response, y=column, fill = cat_response)) +
    #"darkgreen" "forestgreen" '#08519c' 'dodgerblue4'
    geom_violin( alpha = 0.3 ) +
    geom_boxplot(alpha = 0.7) +
    labs( x=cat_response, y=column, title="")+
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
    scale_fill_manual(values=c('hotpink3', 'dodgerblue4')) +
    theme(
      legend.position="none",
      axis.title.x = element_blank(),
      text = element_text(size=14,face="bold"),
      # axis.text.x = element_blank()
    )+
    coord_flip()
  
  # density plot code
  p2=ggplot(data,aes_string(x=column, fill = cat_response)) +
    geom_density(position = "fill", alpha = 0.8)+
    labs( x=column, y="Density", title ="") +
    scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
    scale_fill_manual(values=c('hotpink3', 'dodgerblue4')) +
    theme(
      legend.position="bottom",
      text = element_text(size=14,face="bold")
    )
  # title plot
  title = paste0("Distribution of ",column," for \nDifferent Levels of ",cat_response)
  
  return(list(title,p,p2))
}

get_bivariate_barplot = function(data, column, cat_response){
  
  # barplot code
  p= ggplot(data,aes_string(x=column, fill = cat_response)) +
  geom_bar(position = "dodge",  colour="black",alpha = 0.8) +
  labs( y="Count") +
  scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
  scale_fill_manual(values=c('hotpink3', 'dodgerblue4')) +
  theme(
      legend.position="none",
      # axis.title.x = element_blank(),
      text = element_text(size=14,face="bold"),
      # axis.text.x = element_blank()
    ) + coord_flip()
  
  # colplot code
  
  proportions <- data %>%
    group_by(.dots = c(column, cat_response)) %>%
    summarise(Count = n())%>%
    group_by(.dots = column) %>%
    mutate(Proportion = Count / sum(Count))
  
  
  p2=ggplot(proportions,aes_string(x=column, y="Proportion", fill = cat_response)) +
  geom_col(stat = "identity",  colour="black",alpha=0.8) +
  scale_y_continuous(labels = scales::label_percent(accuracy = 1), #percent
                     breaks = scales::pretty_breaks(n = 10)) +
  scale_fill_manual(values=c('hotpink3', 'dodgerblue4')) +
  theme(
      text = element_text(size=14,face="bold"),
      legend.position="bottom",
      # legend.title=element_text(size=8),
      # legend.text=element_text(size=8)
    ) + coord_flip()
  

  # title plot
  title = paste0("Distribution of ",column," by ",cat_response)
  
  return(list(title,p,p2))
}

# Get plots of quantitative and qualitative features in list object
gg0 <- list()
for( i in Features){
  if(i %in% Quant_feats){
      gg0[[i]] = get_distribution_boxplot(cancer_dat, i, "Treatment.response")
  } else if(i %in% Quali_vars){
          gg0[[i]] = get_bivariate_barplot(cancer_dat, i, "Treatment.response")
  }
}

# display plots in tabs
for (Feature in Features) {
  cat("##### ",Feature,"\n")
  grid.newpage()
  
  if(Feature %in% Quant_feats){
    pushViewport(viewport(layout = grid.layout(3, 1, heights = unit(c(2, 10, 13), "null"))))
     grid.text(gg0[[Feature]][[1]],gp = gpar(fontsize = 16, fontface="bold"), vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
     print(gg0[[Feature]][[2]], vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
     print(gg0[[Feature]][[3]], vp = viewport(layout.pos.row = 3, layout.pos.col = 1))
 
  } else if(Feature %in% Quali_vars){
    pushViewport(viewport(layout = grid.layout(3, 1, heights = unit(c(2, 10, 10), "null"))))
     grid.text(gg0[[Feature]][[1]],gp = gpar(fontsize = 16, fontface="bold"), vp = viewport(layout.pos.row = 1))
     print(gg0[[Feature]][[2]], vp = viewport(layout.pos.row = 2, layout.pos.col = 1))
     print(gg0[[Feature]][[3]], vp = viewport(layout.pos.row = 3, layout.pos.col = 1))
  }
  cat('\n\n')
}
Sex

Age

Weight

Recist.cm

Surface.cm2

Volume.cm3

Number.of.grey.levels

Min.value.stat

Max.value.stat

Median.value.stat

Skewness.stat

Standard.deviation.stat

Contrast.GLCM

Homogeneity.GLCM

Missing.weight

Surface.volume.ratio

Compactness.1

Intensity.range

These graphs represent the distribution of our features for different values of the categorical variable Treatment.response. Some correlations can be inferred just by looking at the plots. For example in the Weight tab, it is clear that the distribution of patients’ weight is different depending on the Treatment Response: On average, a negative response correlates to a higher weight value. One could suspect that being overweight has a negative impact on Treatment response (Note that Weight variable only displays available values which means its visualisation is based on 100 observations instead of 300).
Another thing to note is that the distribution of Age is very similar for both groups, which is surprising, as we could easily expect it to have a (likely negative) effect on the treatment response. It is possible that some frequency matching was applied when designing the data sample in order to avoid a potential confounding effect7.

These plots give a good visual intuition of the relationships. However, we need to verify the existence of these correlations statistically. Different tests exist to compare quantitative as well as qualitative features between two population groups (Responders and Non-responders). We saw during the univariate analysis that not all numeric variables display a normal distribution, hence we will favorise non-parametric tests because they do not require assumptions around distribution and can be applied to all features:

  • We use the Mann-Whitney U Test (Wilcoxon Rank-Sum Test) for quantitative features
  • We use Chi-squared test (applied ton a contigency table) for qualitative features

First, we test the numeric variable difference between Responders and Non-responder groups. The hypotheses:

  • Under the null hypothesis H0: the distributions of both populations are identical
  • The alternative hypothesis H1: the distributions are not identical.

The following assumptions are met in order to run a Mann-Whitney U test:

  • Treatment groups are independent of one another. Experimental units only receive one treatment and they do not overlap.
  • The variable of interest is ordinal or continuous.
  • Both samples are random.

We run the test on all numeric variables and get the following results:

test_sum = data.frame(Quant_feats,p_value=NA)

for(i in 1:length(Quant_feats)){
  f = as.formula(paste(Quant_feats[i], "Treatment.response", sep= "~"))
  res <- wilcox.test(f, data = cancer_dat)
  test_sum$p_value[i] = round(res$p.value,6)
}

test_sum = test_sum %>% mutate(
  at_5perc_level = ifelse(p_value<0.05,"Reject H0", "Cannot reject H0"),
    at_20perc_level = ifelse(p_value<0.20,"Reject H0", "Cannot reject H0")
)

show_table(test_sum, title = "Wilcoxon Rank-Sum Test")

Interpretation: orking with a significance level (alpha) of 5%, we consider that for features with p-values under 0.05, we can reject H0 with a confidence level of 95% and accept the alternative H1: There is a statistically significant difference between Responders and Non-responders groups.

For example, the Volume feature has a p-value of 0.001, hence we reject the null hypothesis at the 5% level.

To observe the relationship between Sex (categorical) feature with our dependent variable, we produce a contingency table: Treatment response - Sex:

# Create contingency table
cont_table=table(cancer_dat$Treatment.response, cancer_dat$Sex)

#format for display
colnames(cont_table) = c(paste0("Sex.",colnames(cont_table)))
rownames(cont_table) = c(paste0("Treatment.response.",rownames(cont_table)))
cont_table = as.data.frame.matrix(cont_table)

# display table
show_table(cont_table, rownames=TRUE, title = "Contingency Table: Sex ~ Treatment Response")

Just observing the table, we can guess that there is a correlation between the two variables, as the Sex distribution is very different whether we look at the positive or negative Treatment response. However to be sure, we run a Chi-squared test where the null hypothesis is H0: Row and column variables are independent.

The following assumptions are met in order to run a Chi-squared test:

  • Independence of Observations
  • Cell Frequencies greater or equal to 5
  • Random Sampling
# run chi square test
chisq.test(cancer_dat$Treatment.response, cancer_dat$Sex)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cancer_dat$Treatment.response and cancer_dat$Sex
## X-squared = 39.388, df = 1, p-value = 3.474e-10

p-value = 0,0000, hence we can reject the null hypothesis at the 5% level: Treatment Response and Patient’s Sex are correlated.

We repeat the same with the Missing.weight variable:

# Create contingency table
cont_table=table(cancer_dat$Treatment.response, cancer_dat$Missing.weight)

#format for display
colnames(cont_table) = c(paste0("Mising.weight.",colnames(cont_table)))
rownames(cont_table) = c(paste0("Treatment.response.",rownames(cont_table)))
cont_table = as.data.frame.matrix(cont_table)

# display table
show_table(cont_table, rownames=TRUE, "Contingency Table: Missing Weight ~ Treatment Response")
# run chi square test
chisq.test(cancer_dat$Treatment.response, cancer_dat$Missing.weight)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  cancer_dat$Treatment.response and cancer_dat$Missing.weight
## X-squared = 1.4417, df = 1, p-value = 0.2299

The p-value is above 0.20, so we cannot reject H0 at the 5% or even the 20% level. It means we do not have enough to conclude the missing Weight values are NMAR. It means that we can impute missing weight values normally during the preprocessing step.


Between Features

In this work, there is a need for explainability of the model and the impact of its features on the predictions; hence, it is important to study the relationships between the different input features in order the identify potentially high correlations, which could lead to multicollinearity in the models and make their interpretation difficult.

We solely focus on features for which we accepted the alternative hypothesis H1 of significant difference between Treatment response’s groups at the 20% level (instead of the usual 5%, to make sure we do not miss out on important features).

To analyse these connections, we need an appropriate correlation metric. Most of the features are numerical, except for Sex which is binary, hence metrics such as Pearson and Spearman are suited for this application. We opt for Spearman correlation coefficient as it is a ranking-based correlation, and can capture non-linear relationship between variables.

We compute the 2 by 2 correlations for all factors and save them in the following matrix:

### Convert factor to dummies 
cancer_dat$Sex = ifelse(as.character(cancer_dat$Sex=="1"),1,0)
cancer_dat= cancer_dat %>% rename(  Sex.female = Sex)
Quali_vars = c("Treatment.response","Sex.female")

## List significant features for pre-selection
Sig_features = Features[Features %in% test_sum$Quant_feats[test_sum$p_value<0.20]]
Sig_features = c("Sex.female", Sig_features)


# Setup correlation function
get_correlation_matrix = function(data){
  
  data = data %>% select_if(is.numeric)
  cormat <- round(cor(data, method = "spearman", use="pairwise.complete.obs"),2)
  melted_cormat <- melt(cormat)
  melted_cormat$p.value = NA
  for(i in 1:nrow(melted_cormat)){
    res = cor.test(data[,melted_cormat$Var1[i]], data[,melted_cormat$Var2[i]],
                   method = "spearman")
    melted_cormat$p.value[i] = round(res$p.value,4)
  }
  return(melted_cormat)
}

# Set up correlation plot function
plot_correlation_matrix= function(melted_cormat){
  p= ggplot(data = melted_cormat, aes(Var2, Var1, fill = value))+
    geom_tile(color = "white")+
    geom_text(aes(label = round(value, 2)), size=2.5) +
    scale_fill_gradient2(low = 'hotpink3' , high ='dodgerblue4', mid = "white",
                         midpoint = 0, limit = c(-1,1), space = "Lab",
                         name="Spearman\nCorrelation") +
    labs(title = "Spearman Correlation Matrix",
         x= "Features", y= "Features") +
    theme(
      text = element_text(size=10,face="bold"),
      plot.title = element_text(hjust = 0.5,size=16),
      axis.text.x =element_text(hjust=1,vjust=0.9, angle=45),
      axis.title.x = element_text(size=14),
      axis.title.y = element_text(size=14),
      plot.margin = unit(c(0.8,0.8,0.8,0.8), "cm"))
  
  return(p)
}

#Get correlation matrix and display plot
mat = get_correlation_matrix(cancer_dat %>% select(Sig_features))
plot_correlation_matrix(mat)

There is a lot of information on the visualisation. White cells indicate little correlation between the variables, blue cells indicate strong positive correlations and pink cells indicate strong negative correlations.

We observe that:

  • Surface.cm2 and Volume.cm3 (and also Number.of.grey.levels) are:
    • highly correlated together. As a consequence, Surface.volume.ratio is also very (negatively) correlated to both of them. To avoid multicollinearity, we should make sure that only one of these 3 features is part of the final model.
    • also moderately correlated to other radiomic features; we need to monitor multicollinearity between these in the final model, for example by evaluating the Variable Inflation Factor (VIF). more on this later.
  • Weight and Sex barely show any correlation with the other variables.

MODELING


Algorithm Selection

Question 3. “Propose and implement several algorithms to deal with the primary objective. You should motivate the choice of the models and evaluate their predictive performances (argue the choice of the model validation technique, model tuning parameters, metrics, etc.).”

In machine learning, there is something called the “No Free Lunch” theorem: No one algorithm works well for every situation. Hence, there is a need for several different models to be trained and tested to evaluate which one is the best for our specific problem.

Here, we focus on supervised learning, more specifically (binary) classification, as the response variable is categorical, with 2 possible outcomes (“Yes” or “No”).

Below is a non exhaustive list of the main binary classification algorithms:

  • Decision Trees
  • Logistic Regression
  • K-Nearest Neighbors (KNN)
  • Random Forest
  • Gradient Boosting
  • Support Vector Machine (SVM)
  • Neural Networks (NNs)

To choose which ones are most appropriate for our work, there are several factors to take into account:

  • Data sample size vs. Model complexity
  • Degree of Explainability
  • Performance

We are working with a dataset of 300 observations (patients), which is considered to be small from a machine learning point of view. Most of the algorithms mentioned above are too complex to be trained with so few data points; this would lead to overfitting the training set and the resulting model would not be robust. Hence, we focus on classifiers that need no or little hyperparameter tuning. We also need to use algorithms that allow for interpretation as an assessment of the variables impact on the predictions is required.

We review the different candidates:

  • NNs: These are deep learning models and require too many observations to be considered. - Rejected

  • SVM: This algorithm is adapted for higher-dimension problems, plus hyperparameters tuning and choice of kernel function must be done carefully. Training for these require more data than we have. - Rejected

  • Gradient Boosting: This model has a similar problem: There are many hyperparameters to tune. - Rejected

  • Random Forest: This algorithm only needs a managable amount of hyperparameter tuning. Moreover the problem of overfitting is minimised as Random Forest considers only a subset of features per tree, and the final outcome depends on all the trees; this process generalises well. Finally, this model is useful to extract feature importance, which is a good indicator of how the variables impact the predictions. - Selected

  • KNN: This algorithm is adapted to smaller datasets; it has only one hyperparameter to tune (K). There are a many initial conditions to fulfill before running it (deal with outliers, scale and balance data, handle missing values etc… ) but this is something that we can take care of during the pre-processing phase. Unfortunately, there is no way for us to interpret the impact of the features on the predictions with this model. - Rejected

  • Logistic regression: This is a relatively simple model, and it is well suited for small data samples; there are no hyperparameter to tune (Actually, there is one if we use LASSO, more details below). Moreover, it has a high degree of explainability as the variables’ coefficients can be interpreted. - Selected

  • Decision Tree: Although there is no complexity or explainability issue with this model, the algorithm is too “simple” and is expected to yield bad results as compared to the other models we chose. - Rejected

To summarise, the models selected for further work are the Logistic Regression and the Random Forest.


Pre-Processing

In the preprocessing phase, we first proceed to preliminary feature selection, based on previous analysis. Then, we split the dataset between training and test sets: The first is used to tune the hyperparameters and train the models, the second is left out until the end and is used to evaluate the models’ performances on new, unseen data. After this, we handle potential outliers and impute missing values. Finally, we can apply some variable transformations (eg. BoxCox or regular standardisation).


-Preliminary- Feature Selection

Both of the algorithms selected for modelling are particularly sensitive to irrelevant independent variables in the model; a careful feature selection is necessary for these algorithms to perform well. Hence, as a preliminary feature selection step, we exclude the factors with no relationship with the target variable from the list of input features of the future models.

Looking back at the significance tests during the bivariate analysis, we keep only variables with a stastistically significant coefficient at the 20% level. This level can be considered quite high, however we adopt a risk averse approach: Some correlations might become evident once independent variables are together in the model depending on the algorithm used. Random Forest in particular will take into account a certain degree of interactions between the different inputs and could reveal the usefulness of some of these variables.

Here are the remaining numerical input variables after filtering:

## take out all numeric features with p_value > 0.20 
alpha = .20
#Display updated table
show_table(test_sum[test_sum$p_value < alpha,], title = "Pre-selected Features")
## Get data with pre-selectedfeautrues only
model_dat = cancer_dat[,c("Treatment.response",Sig_features)]

Regarding our categorical feature Sex, we observed that the variable had a significant relationship with our target response (based on chi-squared test), hence it is kept in the list of model inputs.

Note: This process is technically incorrect. All the descriptive analysis, including the computation of these p-values was done using the entire dataset. However by doing so, we use the complete information available to us to make decisions about the model, whereas only information from the training set should be available at this stage, in order not to “contaminate” the test set, which is supposed to remain untouched until evaluation of predictions performances. Given the small number of data points, we decided to proceed like this regardless in order to have a good initial overview of the data. For this work, the upward performance bias it may cause is considered negligeable.


Training/Test Split

In this step, the patients are divided in two groups: training and test sets. 70% of observations go to training and the 30% remaining go to test. We apply a stratified random split based on the Treatment.response and Sex in order to keep the positive/negative responses and Male/Female proportions the same from one group to the other.

## Divide data into training and test 70/30
set.seed(3456)
trainIndex <- createDataPartition(paste0(model_dat$Treatment.response,model_dat$Sex.female), 
                                  p = .7, 
                                  list = FALSE, 
                                  times = 1)
train_dat<-model_dat[trainIndex,]
test_dat<-model_dat[-trainIndex,]

Handling Missing Values

Early in the analysis, we observed that the Weight contains 200 missing values out of 300 observations, which is huge. In spite of this, the bivariate analysis showed that this is still one of the features with the highest correlation to Treatment.response (when observations are present). We do not want to lose this information; hence we proceed to impute the missing values.

Our initial thought is to use Regression(or Bagged) Imputation using Age and Sex as explanatory variables. The algorithm trained on training data can then be used to estimate training and test set’s missing values. However, we could see in the Spearman correlation matrix displayed earlier that in this data, Weight is (surprisingly) barely correlated to any other demographic feature, so there is little added value in going through this extra work. Finally, we decide to use the training set’s median weight value to impute all missing values (training and test) of the feature. This method is appreciated because it is robust to outliers.

Recist feature also displays missing values, however it is disregarded for model training (p-value > 0.20) so there is no need for imputation.

## Get median value in training set
weight_median_train = median(train_dat$Weight, na.rm=T)

# impute training set
train_dat = train_dat %>% mutate(
  Weight = ifelse(is.na(Weight),weight_median_train, Weight )
)

# Impute test set
test_dat = test_dat %>% mutate(
  Weight = ifelse(is.na(Weight),weight_median_train, Weight )
)

Handling Outliers

The most obvious outliers in Number.of.grey.levels, Weight and Age variables have already been treated in previous steps.

There are some extreme values in the radiomic features, however they do not appear to be due to accidents in measurement or data entry. Moreover, these can be expected in skewed distributions such as the ones we are working with (Surface.cm2 and Volume.cm3 variables for example). We could apply transformations to skewed variables to make them more “Gaussian-distributed” (such as Box Cox tranformation, Yeo-Johnson transformation, etc…) and then see if these observations still appear like outliers. However note that for the algorithms we chose, namely Logistic Regression and Random Forest, there is no assumption of normality for the input variables’ distribution. Hence we can keep the data as is and work with skewed variables; this method can be explored later.

We decide not to treat outliers further for now.

### Apply automated transformations:
# # initialise list
# BNobjects = list()
# 
# ## Set up empty dataframe for transformed training set
# stand_train_dat = train_dat
# stand_train_dat[TRUE,]=NA
# stand_train_dat[,Quali_vars] = train_dat[,Quali_vars]
# 
# ## Set up empty dataframe for transformed test set
# stand_test_dat = test_dat
# stand_test_dat[TRUE,]=NA
# stand_test_dat[,Quali_vars] = test_dat[,Quali_vars]
# 
# ## populate transformed datasets
# for (i in 1:length(Sig_features)){
#   if( Sig_features[i] %in% Quant_feats){
#     
#     BNobjects[[Sig_features[i]]] <- bestNormalize(train_dat[,Sig_features[i]])
#     
#     stand_train_dat[,Sig_features[i]] = predict(BNobjects[[Sig_features[i]]], newdata =train_dat[,Sig_features[i]])
#     
#     stand_test_dat[,Sig_features[i]] = predict(BNobjects[[Sig_features[i]]], newdata =test_dat[,Sig_features[i]])
#   }
# }
# ## Not all variables are standardised, need to make sure there are  all standardised during imputation for lasso to work
# 

Variable Transformation

Although there is no normality assumption necessary for the models we chose to train, the LASSO algorithm used with the Logistic Regression (more details in Modelling step) is optimised when independent variables are centered and scaled. Consequently, we proceed to standardising8 the data before modelling.

### standardising data:
processing_object <- preProcess(train_dat, method = c("center", "scale"))
# Training
stand_train_dat <- predict(processing_object, train_dat)
# test using training parameters
stand_test_dat <- predict(processing_object, test_dat)

Fitting Models

In this section we will train two candidate models: a Logistic Regression algorithm and a Random Forest algorithm. In order to do this, a metric and and validation technique need to be chosen for optimisation.

Accuracy is a very common measure for classification problems, however it is not suited for imbalanced target variables. Indeed, optimising this metric can lead to models over-predicting the majority class. Given we were not instructed to apply different penalties for False Negatives and False Positive classifications, we consider that these are equally undesirable and disregard Accuracy as a potential optimiser. The Area Under the ROC curve (based on Sensitivity and Specificity) is a better option, however it can also be biased. Ideally, we need an optimiser that focuses on maximising both Sensitivity (or Recall) and Precision at the same time. The F1 score does just this as it is the harmonic average of the Precision and Recall. Hence, we use this metric to optimise training of our models.

Whenever tuning hyperparameters, we will use 10-fold cross validation: During modeling, the training set is divided is 10 random sets, and 10 models are trained the same way on 9 of these sets and their performances are evaluated on the remaining 1/10th, called the Kth validation set. All 10 performances are averaged out to provide a more robust performance evaluation of the model on new data than a regular training/validation split method, subject to variation.This operation is repeated for different hyperparameters’ values and the optimal ones are chosen based on the model that displayed the best cross-validated performance metric (here F1 score). Then, when relevant, the model is trained on the entire training data and saved for later steps.

## Set F1 score:
f1 <- function(data, lev = NULL, model = NULL) {
  f1_val <- MLmetrics::F1_Score(y_pred = data$pred, y_true = data$obs, positive = lev[1])
  c(F1 = f1_val)
}

## Set 10 fold cross validation control:
fitControl <- trainControl(## 10-fold CV
  method = "cv", #repeatedcv",
  number = 10,
    # repeats = 10, ## Optional, can go back to method= "cv"
  ## See summaryFunction & selectionFunction to set optimisation parameters yourself
  # summaryFunction = twoClassSummary, ## optimises with ROC, displays ROC, sensitivity and specificity
  summaryFunction = f1, # Optimise with f1 score
  classProbs = TRUE, ## This goes along with line just before.... 
  savePredictions = TRUE
)

Before training the Logistic Regression algorithm, we still need to make sure that all input variables are essential to the model, as its performances can be affected by irrelevant input variables, and its interpretability can be difficult in the presence of multicollinearity. The LASSO algorithm can help us solve this issue.


Feature Selection with LASSO

The Least Absolute Shrinkage and Selection Operator (LASSO) performs L1 regularization, which adds a penalty equal to the absolute value of the magnitude of coefficients. This type of regression is well-suited for models showing high levels of muticollinearity and for automatic variable selection. We expect this model to show us which variables to exclude from the model in order to get rid of irrelevant or redundant factors while keeping predictions performances optimal.

The penalty is called Lambda and is a hyperparameter of the model that needs to be tuned. In order to do so, we run 200 iterations of 10-fold cross validation modeling in a loop, with Lambda taking different values between 0.00001 and 0.1. We select the Lambda value that optimises the F1 score:

lambda <- 10^seq(-5,-1, length = 200)
set.seed(2458) 
lreg_lasso = train(Treatment.response~ .,
                   data=stand_train_dat, method = 'glmnet',
                   tuneGrid = expand.grid(alpha = 1, lambda= lambda), ## 1 for Lasso, 0 for Ridge
                   #standardize = FALSE, ### for cv.glmnet() only
                   metric = "F1",
                   trControl=fitControl) 
## Get best lambda
best_lambda <- lreg_lasso$bestTune$lambda
# Find the row index where the best lambda is in the results
best_lambda_index <- which(lreg_lasso$results$lambda == best_lambda)
# Access the F1 score corresponding to the best lambda
best_f1_score <- lreg_lasso$results$F1[best_lambda_index]

cat("Best Lambda:", best_lambda, "\n")
## Best Lambda: 0.09115888
cat("Best F1 Score:", best_f1_score, "\n")
## Best F1 Score: 0.7689164

The input variables’ coefficients from the model obtained with this Lambda value are the following:

# Model coefficients
coef(lreg_lasso$finalModel, best_lambda)
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                   1
## (Intercept)             -0.34537229
## Sex.female               0.39313187
## Weight                  -0.21933874
## Surface.cm2             -0.08656735
## Volume.cm3               .         
## Number.of.grey.levels    .         
## Min.value.stat           .         
## Skewness.stat            .         
## Standard.deviation.stat  .         
## Homogeneity.GLCM         .         
## Surface.volume.ratio     .         
## Compactness.1            .

Only Sex, Weight and Surface have non-null coefficients. The LASSO process evaluated that they were the most relevant variables for the model. The coefficients for Volume.cm3 and Surface.volume.ratio are now null, which is in agreement with what we observed in the correlation matrix earlier: These factors, along with Surface.cm2 were highly correlated and likely to produce multicollinearity. The LASSO algorithm identified this issue, and kept only the best input variables for prediction, excluding the redundant ones.

We save the new list of input variables; they are our new candidates for the Logistic Regression inputs.

### Get variables retained for following model
coefs=as.data.frame(as.matrix(coef(lreg_lasso$finalModel, lreg_lasso$bestTune$lambda)))
coefs$Variables = row.names(coefs)
coefs =  coefs[coefs[,1]!=0,]
lasso_features = coefs$Variables[-1] ## Without intercept
rm(coefs)

Logistic Regression

The Logistic Regression model is trained on the new list of input variables using the entire training set. There is no longer a defined metric to optimise (except for Max LogLikelihood of course) because there is no more hyperparameter to tune.

## Get only variables retained by LASSO
stand_train_small_dat = stand_train_dat[,colnames(stand_train_dat) %in% c("Treatment.response",lasso_features)]
## run Logistic regression 
set.seed(3456)
lreg<-train(Treatment.response~ .
            ,data=stand_train_small_dat,method="glm",
            family="binomial",
            # metric="F1", ## useless, no hyperparameter left to train, classic max loglikelihood optimised
            trControl=  trainControl(summaryFunction = f1)  # Does not Optimise with f1 score but displays it
 # Cross validation not necessary anymore
            )

# summary(lreg)

We will look at the training output a bit later in the process. However, see below the optimal cross-validation F1 Score obtained:

lreg$results
##   parameter        F1       F1SD
## 1      none 0.7318118 0.04489895

To make sure there is no collinearity left in the model we use the Variance Inflation Factor (VIF) from the Logistic Regression. Each input feature has one. It represents the degree of multicollinearity of this variable with the other inputs. As a rule of thumb, a value under 5 is good, there is very little collinearity. A value above 10 indicates a collinearity problem in the model.

Find the VIFs for every feature below:

car::vif(lreg$finalModel)
##  Sex.female      Weight Surface.cm2 
##    1.126688    1.121123    1.038262

All variables have a VIF under 5 which indicates that there is no multicollinearity issue in the model. This confirms what we saw in the correlation matrix, as Sex and Weight were barely correlated to any other variable (Surface was correlated to other features, but none present in the model)

This model is now a good candidate; we save it for later comparison, using test data for performances evaluation.


Random Forest

The second candidate is Random Forest. As it is equally sensitive to irrelevant or redundant features, we use the knowledge acquired thanks to the Lasso algorithm and choose the same input variables selected for the Logistic Regression model (Note: This method can be discussed; the two algorithms work differently and features disregarded with LASSO could be significant with Random Forest)

Given the size of the data, we favorise small trees for training, with few variables involved and a relatively big minimum node size in order to avoid overfitting with complex trees. The exact value of these hyperparameters, along with the splitting rule (based on Gini Impurity or an Extremely Randomized Trees implementation) will be defined by tuning these parameters with 10-fold cross validation. Moreover, we train with a high number of trees to compensate and insure good performance (here a set number 2000 trees).

Again, F1 Score is used as metric to optimise. We get the following results:

# define grid for rf
ranger_grid <- expand.grid(mtry = c(1, 2, 3), #number of variables chosen randomly per tree
                           splitrule = c("gini", "extratrees"),
                           min.node.size = c( 3,4, 5))

### HOW DO I know how many trees to train ? I don't want to overfit, need enough degrees of freedom
set.seed(3456)
ranger_fit <- train(Treatment.response~ .,
                    data=stand_train_small_dat, method = 'ranger',
                    tuneGrid = ranger_grid,
                    num.trees = 2000,
                    metric= "F1",
                    importance = 'permutation', # "impurity"),
                    trControl=fitControl)

ranger_fit$finalModel
## Ranger result
## 
## Call:
##  ranger::ranger(dependent.variable.name = ".outcome", data = x,      mtry = min(param$mtry, ncol(x)), min.node.size = param$min.node.size,      splitrule = as.character(param$splitrule), write.forest = TRUE,      probability = classProbs, ...) 
## 
## Type:                             Probability estimation 
## Number of trees:                  2000 
## Sample size:                      212 
## Number of independent variables:  3 
## Mtry:                             3 
## Target node size:                 3 
## Variable importance mode:         permutation 
## Splitrule:                        gini 
## OOB prediction error (Brier s.):  0.1913152
# ranger_fit ## corresponding F1 is in
# ranger_fit$results

Our final Random forest model uses 3 random variables for each of its 2000 trees, the minimum node size is 3 data points and the splitting rule used is Gini Impurity. The corresponding cross-validation F1 score is 0.7892

Our second candidate is ready for performances evaluation and comparison.


Models Comparison

Question 4. “Compare these models and argue the choice of the best model.”

To evaluate which model is the most appropriate for our work, we compare the different models’ performances on the test set and choose the best one. We penalise False Positive(FP) and False Negative(FN) equally, hence the metric we choose to focus on is F1 Score (for the same reasons it was the optimiser used). However, other relevant metrics to help our choice are Balanced Accuracy (not Accuracy, as we explained earlier), as well as the balance of Sensitivity (Recall) with Precision and potentially with Specificity.

We run the model predictions on the test data (after applying all the transformations tuned using the training data) and obtain the following results:

## get Logistic regression test metrics
lreg_pred = predict(lreg, stand_test_dat)
lreg_conf = as.data.frame(confusionMatrix(data = lreg_pred, reference = stand_test_dat$Treatment.response, positive = "Yes")$byClass)

# get Random forest Test metrics
rf_pred = predict(ranger_fit, stand_test_dat)
rf_conf=as.data.frame(confusionMatrix(data = rf_pred, reference = stand_test_dat$Treatment.response, positive = "Yes")$byClass)
# Join and display results in a table
results = round(cbind(lreg_conf,rf_conf),4)
colnames(results) = c("Logistic Regression", "Random Forest")
show_table(results, rownames = TRUE)
# Save for later 
bal_acc = round(results[rownames(results)=="Balanced Accuracy","Logistic Regression"]*100)

On this table, the Logistic Regression seems to be a slightly better candidate when looking at our main metric the F1 score, hence, we decide to select it as our final model.
However it is interesting to note that The Logit performs a lot better in terms of Sensitivity but the Random Forest model has a far greater Specificity. It might be pertinent to change models depending on what type of metric we wish to optimise.

Note: These predictive performances are not the best we can obtain with this data. However, model explainability is one of our objectives, hence a trade-off is made between predictive power and interpretability of features to find a good balance between the two.


Features Impact

Question 5. “Characterize the impact of features on the model predictions.”

To analyse the role that each feature plays in the Logistic Regression Model, we can observe the input variables’ coefficents and their corresponding p-values. This interpretation is only possible because of our work to eliminate multicollinearity in the model (validated by the low features’ VIFs).

The Logistic Regression output is displayed below:

summary(lreg)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.5082  -0.8875  -0.2420   0.9495   2.2311  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.6542     0.1944  -3.365 0.000764 ***
## Sex.female    1.2360     0.2485   4.974 6.55e-07 ***
## Weight       -1.0324     0.2438  -4.235 2.29e-05 ***
## Surface.cm2  -0.7904     0.2235  -3.536 0.000406 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 288.42  on 211  degrees of freedom
## Residual deviance: 215.72  on 208  degrees of freedom
## AIC: 223.72
## 
## Number of Fisher Scoring iterations: 5

According to this output, Sex.female, Weight and Surface.cm2 are all statically significant at the 5% level, as all corresponding p-values are under .05, which allow us to reject the null hypothesis. They all contribute significantly to the predictions.

Based on the coefficients’ signs, we can conclude the at the 5% level, Sex.female, contribute positively to a patient responding to treatment, while Weight and Surface.cm2 contribute negatively to treatment response.

If the Random Forest was our final model we could look at the Variable Importance of the different inputs in the model. This measure, obtained using permutation techniques (could have used gini Impurity as well), allows us to assess the relative degree of influence of each feature on the predictions, but it does not tell us whether each feature contribute positively or negatively to predicting a positive treatment response. This however is something that we can infer by looking at the table of individual logistic regressions coefficients produced in the descriptive analysis.

Find the Variable Importance in the graph below:

p1= ggplot(varImp(ranger_fit)) +
  ggtitle("Variable Importance based on \n Random Forest Permutations Technique")+
  # labs( x=column, y="Frequency")+
  scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
  theme(
    text = element_text(size=14,face="bold"),
    plot.title = element_text(hjust = 0.5,size=16),
    axis.text.x =element_text(hjust=1,vjust=0.9),
    plot.margin = unit(c(0.8,0.8,0.8,0.8), "cm"))

## Note: change color
p1[["layers"]][[1]][["aes_params"]]$fill = "dodgerblue4"
p1[["layers"]][[1]][["aes_params"]]$colour = "black"
p1[["layers"]][[1]][["aes_params"]]$size = 0.5
p1[["layers"]][[1]][["aes_params"]]$alpha = 0.8

p1

As you can see, the Sex and Surface.cm2 variables are preponderant in the treatment predictions. Weight is null, meaning that in theory, this feature could be taken out of the model without affecting its predictive performances. This is surprising and it might be worth spending more time on the Random Forest’s hyperparameters tuning.


Model Robustness

Question 6. “How could the robustness of the model be assessed and the individual predictions derived from it be justified?”

A model is considered to be robust if its output and forecasts are consistently accurate even when presented with new data.

One way of evaluating the final model robustness is to evaluate its predictive performances on new data. This is something we have done when training the algorithm on the training set and evaluating its performances on the test set. We can observe that for the Logistic Regression model, the predicting performances are the same between the cross-validation F1 Score (from training) and the test F1 Score (both F1 = 0.73), which is the characteristic of a very robust model. We usually tolerate a small decrease in performance when evaluating on the test set. This is what we observe with the Random Forest Performances, going from cross validation score F1=078 in training and on test score 0.71. There are several potential explanations for this. The first that comes to mind is the small sample size. Indeed, the training set is small, hence it is unlikely to be an accurate representation of its true distribution and generalises badly to new data. However, this does not happen with the first model, which uses the same data sample. Consequently, we can be facing a slight overfit of the data.

Note that the data in the test set comes from the same ‘ABC’ hospital as the the training set. Model robustness can be challenged when predicting on new data from other hospitals, as the nature of some variables might change due to biases linked to the environment (air pollution in the area, local diet, but also factors such as variation in features recording techniques creating biases etc…). For these reasons, model performances are expected to decrease when facing ‘truly’ new data.

Question 7. “Would you propose this model for use in clinical routine?”

This model is not ready for use in a clinical routine.

The final algorithm performances are good but not exceptional, with a balanced accuracy of 77%, which mean that in the presence of new balanced data, the model would predict the wrong treatment outcome about 23% of the time. This model is only 27 percentage points better than a random model predicting “Yes” and “No” alternatively (which would give an accuracy of about 50% on balanced data).

Also, the model was trained on a small sample with data coming from the same hospital, and although it is robust when tested on ABC hospital data, it might generalise badly. It would be necessary to collect more diverse data to train the model further before deploying it. Some other features might be relevant too (eg. Smoker?, Does.Sport?, Height? to impute Weight, etc… )


MIXED MODEL SIMULATION

In the last section, we inferred that the model we produced might be less robust when handling data from other hospitals. As explained, this is suspected because of potential biases specific to each hospital’s environment, such as the general living conditions in the hospital’s region, but also the quality of treatment at the hospital, the way data is collected there etc… We need a model to account for this.

Mixed Models are used when data has both fixed and random effects. Fixed effects are features we want to make inferences about, like the impact of Weight on Treatment response. Random effects account for variability we are not directly interested in, like variations between different hospitals in a medical study. Mixed Models help separate these effects, making statistical analysis more accurate by considering the unique characteristics of the data.

Let’s pretend that the sample data has another variable called Hospital.id, which represent the ID of the hospital where each patient’s data was collected as well as where they received their treatment.

We create this variable ID artificially (about 5 patients per hospital, so around 60 hospitals).

The Mixed Model formula is as follow:

\[ \text{logit}(\text{P}(\text{Treatment.Response}_{ij} = 1)) = \beta_0 + \beta_1 \times \text{Weight}_{ij} + \beta_2 \times \text{Sex}_{ij} + \beta_3 \times \text{Surface}_{ij} + u_j + \epsilon_{ij} \]

Where:

\[ \begin{align*} & \text{Treatment.Response}_{ij} & : & \text{ response variable for individual } i \text{ in hospital } j. \\ & \beta_0 & : & \text{ overall intercept.} \\ & \beta_1, \beta_2, \text{ and } \beta_3 & : & \text{ fixed effect coefficients for Weight, Sex, and Surface, respectively.} \\ & u_j & : &\text{ random effect of hospital } j. \quad \text{Normally distributed with mean 0.} \\ & \epsilon_{ij} & : & \text{ the error term for individual } i \text{ in hospital } j. \end{align*} \]

We get the following results:

# Using the lme4 package
library(lme4)
## Create hospital variable randomly (avg 5 patients per hospital: 60 hospitals)
set.seed(123)
# Create a vector with 60 distinct random values
distinct_values <- sample(1:100, 60)
# Extendthe vector to a length of 300 by repeating the distinct values and shuffle
hosp_vector <-  sample(rep(distinct_values, each = 5))
## Add to data
stand_train_dat$Hospital.id = as.factor(hosp_vector[trainIndex])
stand_test_dat$Hospital.id = as.factor(hosp_vector[-trainIndex])

# Define the mixed-effects model formula
formula <- "Treatment.response ~ Weight + Sex.female + Surface.cm2 + (1 | Hospital.id)"

# Fit the mixed-effects model
mixed_model <- glmer(formula, data = stand_train_dat, family = binomial) #, control = lme4::glmerControl(optimizer = "bobyqa"))

# Exctract variance component
# variance_components <- VarCorr(mixed_model)$Hospital.id
# # Calculate the ICC by dividing the between-group variance by the total variance:
# total_variance <- attr(VarCorr(mixed_model), "sc")^2
# between_group_variance <- sum(variance_components)^2 / length(variance_components)
# ICC <- between_group_variance / total_variance

# Summarize the model
summary(mixed_model)
## Generalized linear mixed model fit by maximum likelihood (Laplace
##   Approximation) [glmerMod]
##  Family: binomial  ( logit )
## Formula: Treatment.response ~ Weight + Sex.female + Surface.cm2 + (1 |  
##     Hospital.id)
##    Data: stand_train_dat
## 
##      AIC      BIC   logLik deviance df.resid 
##    225.3    242.1   -107.7    215.3      207 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.6779 -0.6466 -0.1663  0.7236  3.1107 
## 
## Random effects:
##  Groups      Name        Variance Std.Dev.
##  Hospital.id (Intercept) 0.199    0.446   
## Number of obs: 212, groups:  Hospital.id, 60
## 
## Fixed effects:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.6669     0.2071  -3.220 0.001284 ** 
## Weight       -1.0714     0.2616  -4.096 4.20e-05 ***
## Sex.female    1.2802     0.2647   4.836 1.32e-06 ***
## Surface.cm2  -0.8387     0.2467  -3.400 0.000675 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Correlation of Fixed Effects:
##             (Intr) Weight Sx.fml
## Weight       0.293              
## Sex.female  -0.448 -0.362       
## Surface.cm2  0.262  0.227 -0.252

As expected, the coefficients are very similar to what we obtained with the logistic regression earlier. They could have been identical, but variations in the optimisation method make it reach slightly different results.

In theory, we should observe reduced variation for these coefficients using the Mixed Model, and hence smaller p-values as compared to the original model. However this is not the case; we suspect this is due to the use of a random effects variable Hospital.id that is artificially produced and does not actually hold any significant random effect. There are ways to check this:

  • One is to run a Likelihood Ratio Test (LRT) to compare models with and without random effect. H0: There is no added value in modelling random effects. results are below:
# ### Likelihood ratio test

# Null model is the simpler model (log_reg_model)
null_model <- lreg$finalModel
# Perform Likelihood Ratio Test
lrt_result <- lmtest::lrtest(null_model, mixed_model)
print(lrt_result)
## Likelihood ratio test
## 
## Model 1: .outcome ~ Sex.female + Weight + Surface.cm2
## Model 2: Treatment.response ~ Weight + Sex.female + Surface.cm2 + (1 | 
##     Hospital.id)
##   #Df  LogLik Df  Chisq Pr(>Chisq)
## 1   4 -107.86                     
## 2   5 -107.67  1 0.3868      0.534
## get pvalue and aic values for inline text
p_val = round(lrt_result$`Pr(>Chisq)`[2],4)

lreg_aic = round(lreg$finalModel$aic,1)
mixed_aic = round(summary(mixed_model)$AICtab[1],1)

The p-value obtained is 0.534, hence we cannot reject the null hypothesis. It seems that adding the random effects term does not significantly improve the model.

  • A second way to assess the use of random effects in the model is to compare the AICs between the mixed model and the original model, we observe that the mixed model’s AIC (225.3) is slightly higher than the logistic regression model (223.7) -the smaller the AIC Criterion the better-: it means the original has a slightly better fit. Hence this backs our first finding: the random effect modeling does not seem necessary in this case (as we could expect).

CONCLUSION


Analysis Summary

Question 8. “Finally, summarize the results of the complete overall statistical analysis for a non-specialized public (clinicians).”

In this study,our goal was to build a statistical model to predict patient’s response to a cancer treatment Z three months after protocol initiation. From initiation day, we had demographic features such as patient’s sex and weight for example, and radiomics features such as the lesion’s surface in cm2, volume in cm3, statistical values about the grey intensity etc.. In the first step of this work, we ran a descriptive analysis to observe the individual distribution of these features and then we looked at how they were distributed depending on the treatment response as it was the variable to predict. When the variables distributions are different for patients responding to the treatment and patients that are not, we can infer that these variables have an potential influence on the answer; hence we retained these to be included in the model.

We then started “preprocessing” the data: We chose the most appropriate algorithm candidates for the work and applied a series of statistical methods to make sure the model performances would be good while keeping the interpretation possible to study which features affected the results and how. The candidates were then trained and compared, and the Logistic Regression model was selected as our final model.

In a new population of patients, half of which respond positively to the treatment, we estimated that our model could successfully predict 77% of treatment responses correctly. The main features that contribute to this prediction are the patient’s sex and their weight. Indeed, females are more likely to get a positive treatment response whereas weight affects this likelihood negatively.

Still we concluded that this model was not ready for clinical routines. We don’t exclude that its performances decrease when presented with patients from new hospitals for example. More patient records coming from diverse hospitals are needed to improve the quality of its predictions.


Discussion

Question 9. “Discuss about possible improvements (notably if you had more time to perform the analysis).”

Some other sets of variables could be considered for input variables. If we accept to relax some constraints dictated by the need for interpretability of features, we could maximise predictions potential: Random Forests is likely to perform better with more variables to choose from to build trees; we did not give it a chance to choose its own inputs based on variable importance. KNN would also become a viable candidate for comparison.
Moreover, we could explore resampling techniques to remedy the small sample size problem. Also, oversampling the minority class could be a solution to get more data observations while using accuracy as optimiser measure once the data is artificially balanced.

Th being said, past a certain threshold, the only solution to improve prediction performances on new data would be to access more of it:
In terms of variables, more grey-level related factors are extracted from radiographic medical images that what is provided for this work. It seems that our model barely uses the radiomic features available; one explanation could be that this information is better conveyed in other shapes. we would need to access these to find out. A simple example would be to use the 10th and 90th percentile of grey intensity rather than min and max intensities, as this would be a more robust alternative.
About the non-radiomic features, weight looks very important when it is not missing. Its importance needs to be emphasised to clinicians to enforce a more systematic recording of its value. Also, there are several factors that could explain the underlying relationship between weight and treatment response in more details. Hence, it would be good to collect features such as smoking habit, cholesterol, daily physical activity etc…
Finally, in terms of observations, more is better. With more patients to learn from, we can train better and more robust models, with higher scores and less incline to performance deterioration when faced with new data. To make sure of the latter, we would ideally access data from several different hospitals to make sure the predictions generalise well. In this case, using Mixed Models would be a viable alternative to handle the random effects.


PART II: SURVIVAL ANALYSIS

Each patient was followed after treatment initiation until progression, death. Some patients were censored (e.g., lost of follow-up). Clinicians want to predict Progression-Free Survival (PFS) probability over time for each patient, based on the information collected at treatment initiation time.


DISCUSSION


Survival Analysis

Commonly applied in medical, biological, and social sciences, Survival Analysis (applied to Progression Free Survival) is a statistical method that studies and models the time until a patient experiences a particular outcome, such as progress of disease or death. It also helps us assess factors influencing the occurrence of events over time.

Candidate models for survival analysis include the Cox proportional hazards model, the Weibull model, and the Kaplan-Meier estimator. The Cox model is widely used for its flexibility in assessing the impact of covariates on survival, while the Weibull model is suitable for capturing varying hazard rates. The Kaplan-Meier estimator is non-parametric and useful for estimating survival functions. These models are also designed to handle censored values, which is quite specific to this field.


Cox Proportional Hazard Model

The Cox Proportional Hazard Model is widely used for analyzing the survival time data in the context of time-to-event outcomes. This semi-parametric model assesses the hazard rate9 of an event occurring over time while allowing for the inclusion of “covariates” (independent variables) to examine their influence on the hazard rate. This model assumes that the hazard ratio remains constant over time, offering insights into how covariates influence survival.

A Cox model supports estimation of relative differences in risk between patients with different characteristics, but since it does not estimate the baseline hazard function per se, it does not estimate absolute risks (event probabilities) or absolute differences in prognosis. By contrast, parametric models, although much less popular, are fully specified and therefore can estimate absolute risks and survival probabilities.

It can be expressed as follow:

\[ h(t | x) = h_0(t) \exp(\beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_p x_p) \]

Where:

\[ \begin{align*} & h(t | x) & : & \text{ hazard function at time } t \text{ given covariates } x \\ & h_0(t) & : & \text{ baseline hazard function} \\ & \beta_1, \beta_2, \ldots, \beta_p & : & \text{ coefficients for covariates } x_1, x_2, \ldots, x_p \end{align*} \]

The model also makes several assumptions:

  • Proportional Hazards Assumption: The hazard ratio is assumed to be constant over time, meaning that the relative risk of an event for one group compared to another remains constant.
  • Linearity in Log-Hazard: The log-hazard is assumed to have a linear relationship with the predictor variables.

‘Time’ and ‘Time to Event’ Variables

When training a Cox Proportional Hazard Model, the target response is made of two variables, usually obtained from patients follow ups: Event and Time.to.event:

  • Event variable is a boolean that can be Progression_or_Death (TRUE), or Censored (FALSE)
  • Time.to.event variable is a numeric variable that represents the number of days after the treatment initiation time the corresponding T/F event takes place.

Unfortunately these are not available in our dataset, hence we create them artificially based on the Treatment.response variable, which we assume to be correlated to disease progress in real life data. This variable will then be kept out of the model for the rest of the analysis.

Let’s assume that patients with a negative treatment response have a Progression or Death event on average sooner than a patients that responded positively to treatment. We use a Log-normal distribution to model the Time.to.event variable. Also, let’s assume we have more censoring events in the responders’ group. Have a look at the code below for more details.

# Set the seed for reproducibility
set.seed(123)

# Parameters for the log-normal distribution
mean_days = c(200,400) ## mean days for non responders and responders
mean_log <- log(mean_days)
sd_log <- log(1 + 1)  # Standard deviation on the log scale:  1+ sd_days/mean_days

# Generate the vector of log-normal values
log_norm_no <- round(rlnorm( sum(cancer_dat$Treatment.response=="No"), meanlog = mean_log[1], sdlog = sd_log))
log_norm_yes <- round(rlnorm( sum(cancer_dat$Treatment.response=="Yes"), meanlog = mean_log[2], sdlog = sd_log))

## Add new variables to cancer data
cancer_dat = cancer_dat %>% mutate(
  Time.to.event = ifelse(Treatment.response=="No",log_norm_no,log_norm_yes ),
  Event = ifelse(Treatment.response == "Yes", 
                 rbinom(n(), size = 1, prob = 0.7) == 1, 
                 rbinom(n(), size = 1, prob = 0.95) == 1 
                        ) 
) 

To confirm these new variables are “realistic”, we observe the Averaged Progress-Free Survival Probability over time for each Treatment response group using a Kaplan-Meier plot:

library(ggfortify) # For autoplot to recognise survival object
library(survival) # for Surv(), survfit()
library(survminer) ## For ggsurvplot

# Kaplan-Meier survival analysis
surv_object <- survfit(Surv(Time.to.event, Event) ~ Treatment.response, data = cancer_dat)
#surv() return Tieme to Event with added "+" in case of censoring
# It accepts by default TRUE/FALSE, where TRUE is event and FALSE is censored; 1/0 where 1 is event and 0 is censored; or 2/1 where 2 is event and 1 is censored
# Survfit() create survival curves



### Plot Survival Free Prob visualisation
p = autoplot(surv_object) + 
 labs(x = "\nTime (Days) ", y = "Progress Free Probability \n", 
 title = "Progress Free Survival Times Of \n Cancer Patients \n(Kaplan-Meier plot)", colour = "Treatment Response", fill =  "Treatment Response") + 
  scale_fill_manual(values=c('hotpink3', 'dodgerblue4')) +
  scale_colour_manual(values=c('hotpink3', 'dodgerblue4')) +
  scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
  theme(
    legend.position = "top",
    # line = element_line(size = 4),  # Adjust the line thickness here
    text = element_text(size=14,face="bold"),
    plot.title = element_text(hjust = 0.5,size=16),
    axis.text.x =element_text(hjust=1,vjust=0.9),
    plot.margin = unit(c(0.8,0.8,0.8,0.8), "cm")) 
# Addition that cannot be directly implemented:
p$layers[[1]]$aes_params$size <- 1
p

Looking at the graph, you can see that the population responding to treatment has a higher progress free probability for the entire time period covered.

# Specify the time point for which you want to calculate the average probability
time_point <- 200

# Extract survival probabilities at the specified time point
surv_t_sum <- summary(surv_object, times = time_point)

# Calculate average progress-free probability for each treatment response
avg_t_srvl_no <- round(surv_t_sum$surv[1]*100)
avg_t_srvl_yes <- round(surv_t_sum$surv[2]*100)

Using a specific example: Looking at the values on the 200th day (X-axis), the estimated probability of progress free survival (no progress of the cancer and no death event) within 200 days is 49% for a patient not responding to treatment whereas it is 91% for a patient with a positive response.

We can also observe the median survival times for each group in the following summary:

med_values = surv_object %>% 
  gtsummary::tbl_survfit(
    probs = 0.5,
    label_header = "**Median survival (95% CI)**"
  )

## to get entire summary:
surv_object
## Call: survfit(formula = Surv(Time.to.event, Event) ~ Treatment.response, 
##     data = cancer_dat)
## 
##                          n events median 0.95LCL 0.95UCL
## Treatment.response=No  175    168    195     171     227
## Treatment.response=Yes 125     91    452     421     614

Median survival time is the time corresponding to a survival probability of 0.5. As expected, the median survival time is significantly higher for treatment responder group.

In conclusion, the simulated Time.to.event and Event variables are working as expected.

# We then plot the equivalent Cumulative Incidence Probability:

# library(tidycmprsk)
# library(ggsurvfit)
# 
#  cuminc(Surv(Time.to.event, Event) ~ Treatment.response, data = cancer_dat)
#  
#  ggcuminc()


# library(cmprsk)
# # Create a Cumulative Incidence Plot
# cum_incidence <- cuminc(ftime = cancer_dat$Time.to.event,
#                         fstatus = cancer_dat$Event, 
#                         group = cancer_dat$Treatment.response)
# 
# # Convert the result to a data frame
# cum_incidence_df <- data.frame(time = cum_incidence$ftime, incidence = cum_incidence$est)
# 
# # Plot using ggplot2
# ggplot(cum_incidence_df, aes(x = time, y = incidence)) +
#   geom_step(size = 1) +
#   labs(title = "Cumulative Incidence Curve",
#        x = "Time",
#        y = "Cumulative Incidence Probability")


# The cumulative incidence curve illustrates the probability of a specific event (e.g., disease progression) occurring over time while accounting for competing risks (e.g., death). It shows the proportion of individuals experiencing the event of interest without being censored by other events. While Kaplan-Meier plots estimate the overall survival probability, cumulative incidence considers the occurrence of specific events and is particularly useful when there are competing risks in the study population. The two plots provide complementary perspectives on survival analysis, helping to distinguish between different outcomes over time.

HAZARD MODELING

In this analysis, we would like to see how patients features collected before treatment affect their progress-free survival times.

In order to do this, we will model the hazard rate using the Cox Proportional Hazard Model described above and use Time.to.event and Event variables as target response. Of course we still have to go through feature selection to find the best candidates for prediction.


Pre-Processing

To save time, we focus on the features that were pre-selected in the first phase of the analysis. In a real life scenario we should run a complete bivariate analysis before excluding any variable. Also, we inherit the outliers, missing values and standardisation treatment previously done.

We divide the data between training and test (70/30, as before).

## Add new variables to training and test  data
stand_train_dat= stand_train_dat %>% mutate(
  Time.to.event = cancer_dat$Time.to.event[trainIndex],
  Event = cancer_dat$Event[trainIndex]
)
stand_test_dat= stand_test_dat %>% mutate(
  Time.to.event = cancer_dat$Time.to.event[-trainIndex],
  Event = cancer_dat$Event[-trainIndex]
)

Feature Selection with LASSO

In order to choose the most relevant covariates for our model, we proceed to feature selection for the Cox model using LASSO Regularisation technique with 10-fold cross validation. It finds the optimal Lambda, which maximises the model’s Harrell’s Concordance Index10. The corresponding model shows the following covariate coefficents;

# Prepare arguments
cox_feats = as.matrix(stand_train_dat%>% select(-c(Treatment.response, Time.to.event,Event, Hospital.id)))
cox_tar = Surv(stand_train_dat$Time.to.event, stand_train_dat$Event)

# Make reproducible
set.seed(3456)
# Run LASSO with 10 fold cv
lasso_cox <- cv.glmnet( 
  x = cox_feats,
  y = cox_tar, 
  family = "cox", type.measure="C",  #Harrel's concordance measure to maximise
  alpha = 1,  nfolds = 10,
 )


## Get C for optimal lambda
c_index <- max(lasso_cox$cvm)  # If maximizing C-index. cvm stands for "cross-validated mean."


# Extract coefficients
lasso_coefs <- coef(lasso_cox, s = "lambda.min")
lasso_coefs
## 11 x 1 sparse Matrix of class "dgCMatrix"
##                                   1
## Sex.female              -0.13987838
## Weight                   0.13821898
## Surface.cm2              0.01922020
## Volume.cm3               .         
## Number.of.grey.levels    .         
## Min.value.stat           .         
## Skewness.stat            .         
## Standard.deviation.stat  .         
## Homogeneity.GLCM         0.05110647
## Surface.volume.ratio     .         
## Compactness.1            .

We keep the features with non-zero coefficients; they will be used to train our final model. Without surprise, Sex.female and Weight are part of the selection (They were important features to predict Treatment.response, which we secretly know was used to build Time.to.Event and Time variables).


Model Fitting

We train a Cox PH Model to predict survival times based on the features selected using LASSO in the previous step and get the following results:

## Cox Regression model from Survival Package

# Get LASSO's selected features list
best_feats_names= rownames(as.matrix(lasso_coefs))[as.numeric(lasso_coefs)!=0]

# Write final formula for Cox model
best_feats_str= paste(best_feats_names, collapse = "+")
f2 = as.formula(paste("Surv(Time.to.event, Event)",best_feats_str, sep= "~"))

# Fit Cox model
cox_model = coxph(f2, data = stand_train_dat,
                  # method = "breslow", # for handing ties
                       model =TRUE # To return model in component "model"
                  )
summ = summary(cox_model)

## get weight exp(coef) for explanation
weight_exp = round(summ[["coefficients"]][rownames(summ[["coefficients"]])=="Weight","exp(coef)"],2)

# Get concordance
c_index =  round(summ$concordance[1],2)

# Display results
summ
## Call:
## coxph(formula = f2, data = stand_train_dat, model = TRUE)
## 
##   n= 212, number of events= 182 
## 
##                      coef exp(coef) se(coef)      z Pr(>|z|)   
## Sex.female       -0.22253   0.80049  0.07326 -3.038  0.00238 **
## Weight            0.20741   1.23049  0.06784  3.057  0.00223 **
## Surface.cm2       0.09054   1.09476  0.07493  1.208  0.22696   
## Homogeneity.GLCM  0.11546   1.12239  0.07467  1.546  0.12204   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##                  exp(coef) exp(-coef) lower .95 upper .95
## Sex.female          0.8005     1.2492    0.6934    0.9241
## Weight              1.2305     0.8127    1.0773    1.4055
## Surface.cm2         1.0948     0.9134    0.9452    1.2679
## Homogeneity.GLCM    1.1224     0.8910    0.9696    1.2993
## 
## Concordance= 0.611  (se = 0.022 )
## Likelihood ratio test= 20.72  on 4 df,   p=4e-04
## Wald test            = 22.76  on 4 df,   p=1e-04
## Score (logrank) test = 22.98  on 4 df,   p=1e-04

The Wald tests associated with every covariate’coefficient suggest that Weight and Sex.female are the only significant terms at the 5% level after accounting for the other predictors in the model. That being said, the remaining coefficients still display relatively small p-values.

The hazard ratios specified by the exp(coef) results is a measure of effect size. Eg. The hazard ratio associated with a 1 standard deviation-kg (it should be 1kg but the variable was standardised, so 1 unit is 1 standard deviation, not 1kg) increase in Weight is 1.23, this means each additional -standard deviation- kg at diagnosis is associated with a 1.23-fold increase in the hazard of progress of cancer.

Harrell’s concordance index is a measure of the discriminatory power of a survival model. It ranges from 0 to 1, where 0.5 indicates random chance (no predictive ability). 0.5 < C-index < 0.7 usually shows poor to moderate discrimination, however values of 0.6 to 0.7 are more common in survival data. Here, our concordance on the training set is 0.61, which is a fairly typical value.

The Likelihood ratio test at the bottom of the output compares this model to a null model which predicts the mean survival time for all subjects. The small p-value indicate that this model is significantly better than the null model. We run an ANOVA (technically an analysis of deviance) to get more details on the likelihood-ratio test:

# Perform likelihood ratio test
lr_test <- anova(cox_model, test = "Chisq")

# Print results
print(lr_test)
## Analysis of Deviance Table
##  Cox model: response is Surv(Time.to.event, Event)
## Terms added sequentially (first to last)
## 
##                   loglik  Chisq Df Pr(>|Chi|)   
## NULL             -806.48                        
## Sex.female       -802.71 7.5431  1   0.006024 **
## Weight           -798.46 8.4864  1   0.003578 **
## Surface.cm2      -797.28 2.3642  1   0.124151   
## Homogeneity.GLCM -796.12 2.3280  1   0.127065   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The output shows the results of the likelihood ratio test, comparing the fitted model to “previous step” - model.

In our results, Sex.female and Weight have low p-values, suggesting that including these variables significantly improves the model fit.

Taken in this order, none of the last two variables appear to add significant predictive value (at the 5% level), given that we have already accounted for the preceding variables. That being said, once more the p-values are relatively small.


Model Evaluation

In this step we decide to go beyond the default model performance metrics provided by the training output. We test that proportional hazards assumption is verified, before assessing model Discrimination and Calibration on unseen data (test set).

“Successful validation of a model means achieving satisfactory discrimination and calibration (prediction accuracy) in the validation sample. Validating Cox models is not straightforward because event probabilities are estimated relative to an unspecified baseline function. (…) The methodology for validation is not particularly well worked-out for models of time to event data. The main reason why time to event data provide a greater challenge is the almost invariable censoring of some observation times, caused by some patients’ outcomes remaining unascertained within the study period.” - from Patrick Royston & Douglas G Altman’s research paper


Testing Proportional Hazards Assumption

As mentioned earlier, the Cox Model relies on a couple of assumptions. It is important to examine Schoenfeld residuals to insure the proportional hazards assumption is verified. If violation is suspected, a simple Cox Model is not reliable and other models should be considered.

# Test proportional hazards assumption
cox_zph <- cox.zph(cox_model)

# Plot Schoenfeld residuals against time
res_plot = ggcoxzph(cox_zph,
         #var = "Weight" ##choose variable here (put in loop)
         ) 

# Print results
print(cox_zph)
##                    chisq df     p
## Sex.female       0.27381  1 0.601
## Weight           0.00688  1 0.934
## Surface.cm2      3.78734  1 0.052
## Homogeneity.GLCM 2.26509  1 0.132
## GLOBAL           5.11041  4 0.276

A high p-value (greater than the chosen significance level, e.g., 5%) suggests that the proportional hazards assumption is not violated for that covariate. In our results, none of the p-values are below 0.05, indicating no strong evidence against the proportional hazards assumption.

We can observe the Schoenfeld residuals obtained for each variable in the scatter plots below:

# Initialise list object
gg0 = list()

# Start loop through variables
for( i in 1:length(best_feats_names)){
  # Get variable name
  feat_name = best_feats_names[i]
  
  # improve residual plot for the variable
  p= res_plot[[i]] +
    scale_x_continuous(breaks = scales::pretty_breaks(n = 10)) +
    scale_y_continuous(breaks = scales::pretty_breaks(n = 10)) +
    theme_gray() +
    theme(
      text = element_text(size=14,face="bold"),
      plot.title = element_text(hjust = 0.5,size=16),
      axis.text.x =element_text(hjust=1,vjust=0.9),
      plot.margin = unit(c(0.8,0.8,0.8,0.8), "cm"))
  
  p[["layers"]][[2]][["aes_params"]][["colour"]]= "dodgerblue4"
  p[["layers"]][[2]][["aes_params"]][["size"]] = 1.3
  p[["layers"]][[1]][["aes_params"]][["size"]] = 1.3
  
  ## Assign ggplot object
  gg0[[feat_name]] = p
  
}




# Display variable plots in different tabs
for (Var in best_feats_names) {
  cat("##### ",Var,"\n")
  print(gg0[[Var]])
  cat('\n\n')
}
Sex.female

Weight

Surface.cm2

Homogeneity.GLCM

We’re looking for the smooth curve to be fairly level across the time horizon here, as opposed to substantially increasing or decreasing in level as time passes. It seems to be the case for every plot, which agrees with the test results.


Discrimination

“Discrimination, sometimes known as ‘separation’, is the extent to which risk estimates from a model characterise different patient prognoses. Patients predicted to be at higher risk should exhibit higher event rates than those deemed at lower risk.” - from Patrick Royston & Douglas G Altman’s research paper

We assess the model’s ability to discriminate between different risk groups using concordance index (C-index) on test data. This metric was provided for the training set during modeling but evaluating on the test set allow use to assess whether the Cox Model generalises well on unseen data.

### We need to test this on the test data !!!
temp <- survfit(cox_model, newdata = stand_test_dat)
pred_test <- summary(temp)$table[,"median"] ## median survival times predicted for each patient

# Compute 
SurvMetrics::Cindex(
  object = Surv(stand_test_dat$Time.to.event,stand_test_dat$Event), ## surv() object for test values
  predicted = pred_test ## prediction times for test values
  , t_star = -1)
##  C index 
## 0.628924

Here, the C index obtained on test data is similar to the one we got on the training set, which means the model generalises well. Again a result between 0.6 and 0.7 is not exceptional, but still a standard performance for a Cox model.


Calibration

Calibration refers to how well the predicted probabilities align with the actual outcomes. To evaluate model calibration, we can use calibration plots or metrics that measure the agreement between predicted and observed event probabilities. Commonly used test statistics include the Hosmer-Lemeshow (HL) test. This statistic compare the observed and expected events across risk groups.

Here, we apply the Hosmer-Lemeshow (HL) test on the training set (Note: We should run this on the test set but could not find the proper code design in time). To do so, we divide the sample in 5 groups/stratas of similar sizes based on risk scores and test for the difference between observed and predicted survival rates in all groups. The overall score follows a chi-squared distribution.

#### Calibration test based on training set:; 

# Divide population in G equal groups based on PI distribution
survMisc::gof(cox_model, G=5)
## $groups
##                 n  e      exp          z         p
## 1: <multi-column> 30 33.22061 -0.5587715 0.5763177
## 2: <multi-column> 33 37.56872 -0.7453868 0.4560379
## 3: <multi-column> 39 35.69896  0.5524888 0.5806135
## 4: <multi-column> 39 36.06183  0.4892747 0.6246472
## 5: <multi-column> 41 39.44988  0.2467987 0.8050640
## 
## $lrTest
## Analysis of Deviance Table
##  Cox model: response is  Surv(Time.to.event, Event)
##  Model 1: ~ Sex.female + Weight + Surface.cm2 + Homogeneity.GLCM
##  Model 2: ~ Sex.female + Weight + Surface.cm2 + Homogeneity.GLCM + indicG
##    loglik  Chisq Df Pr(>|Chi|)
## 1 -796.12                     
## 2 -793.08 6.0817  4     0.1931
# we compare number of events observed (e) and expected (exp) and in each groups
# high p value: cannot reject H0 they follow the same distribution

We observe that none of the Z statistics for individual group return significant differences (because p-values are over 5%), neither does the overall chi-square test for the model. We conclude that the model is well calibrated.


Discussion

If the examination of Examine Schoenfeld residuals had revealed that the Proportional Hazards Assumption was violated, it would have meant that a simple Cox Model was not appropriate for this work due to the presence of Time-varying covariates. In this case, we can use a Weibull model to capture time varying effect, or stratification, or other techniques.

Next, we could run a sensitivity analysis to explore the impact of potential violations of assumptions or unmeasured confounding.

We could also train a Shared Frailty Model to take the random effects of Hospital IDs into account.


Foot Notes


  1. Also called “Target”, “Target Variable” and “Dependent Variable” in this work↩︎

  2. We assume that each of the 300 patients presents only one malignant lesion, whose radiomic features are recorded in the dataset↩︎

  3. Variables that are associated with the future course or outcome of a medical condition, disease, or treatment. Also called “Features”, “Factors”, “Explanatory Variables” and “Independent Variables” in this work↩︎

  4. Quantitative and high-dimensional image features extracted from medical imaging data. Also called “Biomarkers” in this work↩︎

  5. RECIST: a set of published rules that define when tumors in cancer patients improve (“respond”), stay the same (“stabilize”), or worsen (“progress”) during treatment.↩︎

  6. Winzorisation: Caps the extreme values to a predetermined threshold. This threshold should be determined using training data only to avoid model contamination. However in this study the dataset is small, we want to use as much information as we can get, and the impact of model contamination in this case is negligible. Hence we compute the threshold using the entre dataset’s percentile values.↩︎

  7. Confounding variables are factors that can influence both the predictor variables and the outcome variable (treatment response)↩︎

  8. Standardisation: Process of removing mean from all observations and dividing by standard deviation to center and scale variable↩︎

  9. Hazard: In survival analysis, it represents the instantaneous rate of occurrence of an event (such as death, failure, or occurrence of a specific outcome) at a given point in time, given that the individual has survived up to that time.↩︎

  10. Harrell’s Concordance Index is a measure of the discriminatory power of a survival model↩︎