Data-Driven Statistical Modeling and Analysis of the Survival Times of Multiple Myeloma Cancer

Multiple Myeloma (MM) has been and continues to be the subject of many research studies. The main goal is to improve the therapeutic/treatment process of survival of MM patients. Based on the 2012-2016 MM cases and deaths, the number of new cases was 6.9 per 100,000 men and women per year, and the number of deaths was 3.3 per 100,000 men and women per year. It is therefore imperative to research into MM. In the present study, we proposed a data-driven statistical model for the survival time of 48 patients diagnosed with multiple myeloma as a function of 16 attributable risk factors. We identified 9 attributable risk factors out of 16 and one interaction term to be significantly contributing to the survival time. They are Bence Jone protein in urine, blood urea nitrogen (BUN)/serum creatinine, infections, % myeloid cells in peripheral blood, fractures, serum calcium, gender, platelets and age, and white blood cells & total serum protein an interaction term. The proposed model satisfied all the model assumptions, passes the residual analysis test and has very high prediction accuracy. Thus, it passes the goodness-of-fit test and the qualities of a good model. The identified significant attributable risk factors and the interaction has been ranked based on the percent contribution to the survival time. The proposed model was evaluated and compared with other existing models of survival of multiple myeloma. Our model is very accurate and also identifies some new significant risk factors. The study offers an improved strategy for the therapeutic/treatment process of multiple Myeloma Cancer.


Introduction
Multiple Myeloma (MM), also known as Kahler disease, myelomatosis, and plasma cell myeloma is a type of cancer that starts from a malignant plasma cell (specifically the white blood cell) [1]. In the human body, the plasma cell produces antibodies as part of the human immune system that helps fight against germs and harmful substances. Myeloma is caused by the plasma cell becoming abnormal called the myeloma cell. With time, the myeloma cell accumulates in the bone marrow, where they crowd out healthy blood cells and may cause damage to the solid part of the bone. Multiple myeloma is therefore caused by the accumulation of the myeloma cells in the bones [2][3][4][5][6][7][8]. Figure 1 below shows the development of the myeloma cells [4,[5][6][7][8][9][10][11].
The abnormal plasma cells produce abnormal antibodies, which can cause kidney problems and overly thick blood [12]. The initial identified risk factors of MM include obesity, radiation exposure, family history, and certain chemicals [13][14][15][16]. Some recommended treatment for multiple myeloma is focused on decreasing the clonal plasma cell population and consequently decrease the symptoms of disease [17]. For patients under the age of 65, the preferred treatment is high-dose of chemotherapy, commonly with bortezomib-based regimens, and lenalidomidedexamethasone followed by autologous hematopoietic stem-cell transplantation (ASCT), that is, trans-plantation of a person's own stem cells [18]. In 2017, a meta-analysis performed showed that post-ASCT maintenance therapy with lenalidomide enhanced progression-free survival and overall survival in persons at standard risk [19]. Whereas, in 2012, it was found from clinical trials that intermediate and high-risk disease patients benefit from a bortezomib-based maintenance regimen [20]. Health Science Journal ISSN 1791-809X the U.S. [21][22][23][24]. The Surveillance, Epidemiology and End Results (SEER) Cancer Institute reported in 2019 that MM constitutes 1.8% of all new cancer cases in the U.S. and ranked 14 among the list of cancer diseases [2]. They further projected that there will be an estimated number of 32,110 new cases of MM and an estimated 12,960 people will die of this disease. Those figures are staggering and overwhelming and cannot be overlooked. This is a rise compared with the estimated number of new MM cases of 24,050 reported in 2014 [3,6]. The established risk factors of MM are common among the age, black race, families with MM history, and being a male [2,7]. Reported earlier [2], 63.1% of all races and sexes of MM cases from 2012-2016 are aged 65 or greater.
Evidence about the risk factors or what typically causes MM remains scant. The existence of the myeloma plasma cell has not been quantified to be able to assess the contributing risk factors of MM. However, several risk factors have been identified to have some relation with the survival of patients with MM [9,10]. Most of these factors were identified at the time a patient was diagnosed with MM. Earlier studies [10,22,23]  In the present study, we developed a real data-driven statistical model of the significant attributable risk factors of survival time of patients diagnosed with multiple myeloma. The clinical trial that was conducted consisted of 65 patients who were diagnosed with MM. However, our study concentrated on 48 of the patients that we have death times (survival times) from diagnosis. The remaining 17 patients, we did not have information about their death times, so they were excluded from our analysis and modeling. Because of the low amount of the data, we did the modeling utilizing the 48 pieces of information. The data was filtered to fulfill all the modeling assumptions. After the development of the statistical model, we used the bootstrapping, resampling method to increase the amount of information, and then improved the accuracy of our statistical model. We identified the significant risk factors, and interaction contributing to the survival time of MM. The significant risk factors including the interaction identified were ranked based on the percentage of contribution to the death of MM patients, using the coefficient of determination (R 2 ) of the survival times. The quality and accuracy of the proposed model was assessed based on the R 2 along with the R 2 adjusted statistic, the Akaike information criterion (AIC) of model selection, the prediction error sum of squares (PRESS), the root mean square error (RMSE), the variance inflation factor (VIF), the residual analysis, and the prediction accuracy (the correlation of the actual and predicted survival times based on 80% training set and 20% testing set).

Method Data description
The data used in this research is from West Virginia University Medical Center provided by Harley [25,26]. Originally, the data constituted survival times of 72 multiple myelomas (MM) patients diagnosed and treated with alkylating agents [25]. 65 out of 72 patients provided complete data for 16 concomitant variables (risk factor) whiles the remaining 7 were eliminated due to missing data in at least one of the 16 risk factors. Given that a patient is diagnosed with myeloma, the 16 risk factors were recorded and the time up to which the patient survived the disease was also recorded (called the survival time from diagnosis to the nearest month). Of the 65 patients, 48 and 17 were dead and alive, respectively. In the present research, we utilized the complete data of the 48 patients death times for our analysis and modeling. The survival time of patients is the response variable with the information of the 16 the risk factors listed below. Thus, we have one continuous response variable, 11 continuous risk factors, and 5 categorical risk factors. The detailed description of the response variable and the 16 risk factors are given in Table  1 below.
Before we proceeded to develop the statistical modeling of the survival times of patients diagnosed with multiple myeloma, we wanted to know whether there is a difference in the survival times with respect to gender, i.e. Male and Female. Given that we have a small data of only 48 patients, we used the log-rank test [27,28] from Kaplan-Meier non-parametric test and compare the differences in survival times of male and female. From Figure  2 below, the log-rank test resulted in a large p-value=0.45, Figure 2 Log-Rank test for Difference in Survival Time of Gender.

Symbol Variable Name t
Survival time from diagnosis to nearest month +1 X 1 Log blood urea nitrogen (BUN)/serum creatinine at diagnosis X 2 Hemoglobin at diagnosis X 3 Platelets at diagnosis 0 abnormal, 1 normal X 4 Infections at diagnosis 0 none, 1 present X 5 Age at diagnosis (complete years) X 6 Gender 1 male, 2 female X 7 Log white blood cell (WBC) at diagnosis X 8 Fractures at diagnosis 0 none, 1 present X 9 Log %BM at diagnosis (log % plasma cells in bone marrow) X 10 % Lymphocytes in peripheral blood at diagnosis X 11 % Myeloid cells in peripheral blood at diagnosis X 12 Proteinuria at diagnosis X 13 Bence Jone protein in urine at diagnosis 1 present, 2 none X 14 Total serum protein at diagnosis X 15 Serum globin (gm%) at diagnosis X 16 Serum calcium (mgm%) at diagnosis with zero mean and standard deviation of one, ε ∼ N (0,1) as n → ∞. This was tested using the normal probability Q-Q plot and was verified using a formal test of normality i.e. the Shapiro Wilk's test with null hypothesis H 0 , that the residuals errors follow the normal probability distribution.

Homoscedasticity
The residual errors should have constant variance, var (ε i )=σ 2 . We verify this by observing the plot of residuals versus fitted values; no pattern implies errors have constant variance. We then supported it with a formal test of non-constant variance with the null hypothesis H 0 , that the variance of the errors is constant.

None or very minimum multicollinearity
The risk factors should not be highly correlated. Usually, a correlation coefficient of r ≥ 0.9 indicates a very high correlation. A formal test for multicollinearity is using the variance inflation factor ( ) ; VIF > 10 implies the presence of multicollinearity.

No autocorrelation
Residual errors are independent and uncorrelated, ε i ∼ i.i.d / N (0, σ 2 ). We checked this using a formal test of autocorrelation, i.e. Durbin Watson test with null hypothesis H 0 , that there is no autocorrelation.
We started by visually inspecting the matrix of scatter plot to assess the linear relationship between the response variable t and the continuous risk factor X i . As shown in Figure 3, there is a weak linear relationship between the response variable t, and all the continuous risk factors given that the highest correlation coefficient is r=0.31, which is with X 1 . The distribution of the indicating a failure to reject the null hypothesis that there is no difference with respect to gender. This is a good justification to proceed with the building of the statistical model for the survival time of patients with MM since there is no bias concerning gender.

Statistical modeling
We develop a statistical model for the survival times (death times) of the 48 patients diagnosed and died of multiple myeloma. In the building of the statistical model for multivariate linear regression, the following assumptions must be satisfied:

Linearity
There should be a linear relationship between the response variable t (survival time) and the risk factors including interactions. This is expressed as where the response variable t i =(t 1 ,...,t n ) T , α=(1,...,1) T is the model intercept parameter, β=(β 1 ,...,β k ) T is the coefficient parameter of the attributable risk factors X i 's, ρ ij is the coefficient parameter of interaction between i th and j th attributable risk factors, εi=(ε 1 ,...,ε n ) T represents the model residual error term, and k=16 and n=48 is the number of attributable risk factors and the sample size, respectively. Linearity was assessed using the matrix of scatter plots and correlation between the response and the continuous risk factors.

Multivariate normality
The errors should follow Gaussian normal probability distribution survival times t is right-skewed as it follows the three-parameter log-normal probability distribution (from the parametric analysis). We can see that some of the risk factors have skewed shaped distributions (a possible influence of outliers or extreme values). However, we continued to fit a model of the response variable as a function of the 16 attributable risk factors resulting in a coefficient of determination (R 2 )of 0.48 (48%); this cannot be considered a good model given that there are discrepancies associated with the data such as skewness or kurtosis [29][30][31][32][33]. However, fitting the first model to the original data allow us to check for other model assumptions.
In Figure 4, we plotted the Q-Q plot of residuals of the model built from the original data to assess the multivariate normal probability distribution. There is evidence of violation from normality as shown by the skewed ends of the Q-Q plot. A formal test for normal distribution using the Shapiro Wilk's normality test resulted in a p-value=8.632e −03 , which is an indication of a lack of the normal probability distribution. This implies that the survival time t of patients with multiple myeloma does not follow the Gaussian probability distribution.
Given that there is a weak linear relationship and no multivariate normality, we applied log transformation to the response variable t and the skewed risk factors X 10 , X 12 , X 14 and X 16 . Log transformation stabilizes the variance and suppresses the impact of outliers or extreme values in the data [29]. The transformations are giving by the expressions below: and Where ' i X denotes the transformed risk factor of X i . After the variable transformations, we proceeded to fit the full model for the survival times t as a function of the 16 risk factors and all two-way interactions between them. We then utilized the backward elimination procedure for model selection to find the attributable risk factors and the interaction(s) that significantly contributes to the survival time t. The backward elimination model selection technique is often used because it provides less bias mean square error (MSE) values and turns to prevents overfitting of the model, which is essential for the prediction performance of the model. Using this method of model selection, we selected the best model with the least Akaike information criterion (AIC=2ln(L) + 2k, where L is the value of the maximum likelihood function of the model and k represents the estimated model parameters) [27]. AIC gives an estimation of the relative amount of information missing in the model; hence, the smaller the AIC value the better the quality of the model. Therefore, given the model selection method and criterion of choice of a good model, the best-proposed model with R 2 =0.8741 which includes all the attributable risk factors and interaction that significantly contributes to the survival time of patients with multiple myeloma is given by ( ) Thus, there are nine attributable risk factors, namely, Bence Jone protein in urine, blood urea nitrogen (BUN)/serum creatinine, infections, % myeloid cells in peripheral blood, fractures, serum calcium, gender, platelets and age, and one interaction term, namely, white blood cells and total serum protein that significantly contribute to the survival of MM patients. The following remaining five risk factors do not contribute to the survival time of MM patients at diagnostic: hemoglobin, plasma cells in bone marrow, lymphocytes in peripheral blood, proteinuria and serum-globin (gm%). Because the estimated survival time t' and the attributable risk factors ' i X from equation (3) are based on the log-transform data from equation (2), we utilized the anti-logarithmic to transform back to the original To use the above proposed model given by equation (3), we first take the anti-logarithmic of the log transform attributable risk factors into the original values, given in equation (4). We then take the anti-logarithmic of the entire model in equation (3)  . The R 2 is generally used to measure the goodness-of-fit of a statistical model. It estimates the proportion of variation in the response variable explained by the model attributable risk factors [30,31]. The higher the R 2 statistic the better the goodness-of-fit of a statistical model. In general, the R 2 is defined by   (4). SS reg is the regression sum of squares representing the variation explained by the proposed model, SS res is the residual sum of squares representing the variation in the proposed model left unexplained and SS tot is called the total sum of squares is proportional to the sample variance, and equals to the sum of SS reg and SS res . Generally, the R 2 has the problem of increasing by increasing the number of parameters or predictors in the model. Therefore, it is recommended that we estimate the R 2 along with the

Bootstrapping with the proposed statistical regression model
To further improve the efficiency of the proposed statistical model, we utilized the bootstrapping resampling method that due to Efron (1979). Bootstrapping is a general approach to statistical inference that allows assigning measures of accuracy (defined in terms of bias, variance, confidence intervals, prediction error or some other such measure) to sample estimates based on building a sampling distribution for a statistic by resampling from the actual data that we analyzed in the present study [34]. We applied the bootstrap sampling to resampled with replacement the data used to build the proposed analytical model given by equation (3); increasing the sample size by 300. This asymptotically increased the level of significance of the coefficient estimates, making them equally highly significant, and increased both the R 2 and 2 adjusted R to 91.16% and 90.85%, respectively. The modified version of the model in equation (3)

Validation of the Proposed Statistical Model
Before validating the proposed model, we need to be sure that all assumptions that underline our proposed model are satisfied. We tested for linearity by showing the linearity plot (sometimes referred to the partial residual plot) of the response variable and the significant attributable risk factors as shown in Figure 5, below. We can see that there is a well-established linear relationship between the response variable and the continuous attributable risk factors (shown by the blue and pink lines). Therefore, the linearity assumption which was initially a problem we encountered has been rectified in our final proposed statistical model.
To verify that the proposed statistical model satisfies multivariate normal probability distribution assumption, we used the normal Q-Q plot shown in Figure 6. We see that the residuals are normally distributed with no major outlier and all the points in the plot fall within the 95% confidence bound. The evidence of normality is supported by the Shapiro Wilk's test of the normal probability distribution (a formal test), given by a high p-value of 0.818. The plot of the distribution of studentized residuals in the second panel of Figure 6, is further evidence that the proposed model's normality assumption is valid.
We performed a residual analysis to assess the model residuals and constant variance. Figure 7, depicts the residual plot of the proposed model. Thus, we can conclude that there is no problem of homoscedasticity.
Our proposed statistical model perfectly satisfies the assumption of constant variance, indicated by the randomly scattered points about the zero line with no major outliers. A formal test for homoscedasticity revealed a p-value of 0.506, which strongly supports that the homoscedasticity of our proposed model is valid.
The mean absolute value of the residuals, Multicollinearity is a major problem in statistical modeling which must be addressed. It can distort the precision of the estimated coefficients leading to overfitting and misinterpretation on the results of the model. All the estimates of the parameters in our proposed model have a very small variance inflation factor VIF < 3 indicating that there is no problem of multicollinearity. Also, we expect the model residuals to be independent and uncorrelated. We tested for the presence of autocorrelation among errors in the proposed model using the Durbin Watson of testing the null hypothesis H 0 , no autocorrelation is present. Accepting the hypothesis with a large p-value of 0.624 indicated that there is no autocorrelation among residuals in our proposed model.
To validate the prediction accuracy of our proposed statistical model, we trained 80% of the data to build our model and tested on the remaining 20% test data. The prediction of the original  model and the trained model using the test data is given in Table  2. We checked the accuracy of the predictions by finding the correlation coefficient r and the corresponding R 2 (square of r) between the actual and the predicted values. This resulted in R 2 of 0.943, a very high prediction accuracy. The comparison of the logarithmic survival times with the two models (i.e. model developed using all the 48 patients and the 80% trained model) prediction on the test data resulted in R 2 of 0.943 and 0.930, respectively, attesting to the high prediction accuracy of our proposed model.

Ranking of the contribution of attributes/ risk factors of the survival times of multiple myeloma
In this section, we rank the individual significant risk factors and the interaction based on their contribution to the survival time of MM patients using the percentage of R 2 . Table 3 shows the rank of each of the identified significant risk factors and the interaction term. Bence Jone protein in urine is ranked first, followed by blood urea nitrogen (BUN), the interaction term is ranked eighth, and age has the least contribution to the survival time of patients diagnosed with multiple myeloma (MM) among the significant attributable risk factors. A detailed discussion of the rankings will continue in the next section.

Discussion
The evaluation of the survival time of patients diagnosed with MM is an essential prerequisite for improving the prognosis and therapeutic/treatment strategy of multiple myeloma. The present study was designed to find a real data driven statistical model that accurately predicts the survival time from diagnosis to the nearest month of multiple myeloma patients deaths. In the present study we accomplished the following: • We identified the significant attributable risk factors.
• We identified the significant interactions among the risk factors.
• We determined the percentage of contributions of each identified risk factor and interaction that causes the death of the patients.
It was important to assess whether there is a difference in the survival times with gender in which we found no difference, a good characterization for our data analysis of the development of our model. We started building the statistical model with 16 predictors (risk factors) reported to be contributing to the survival of MM but we only found nine (9) individually contributing factors along with a single interaction. Most of the risk factors in our data have been reported to be important by several researchers [9,22,32,[35][36][37][38][39], however, we did not find all of them to be important. The final proposed model that accurately predicts the survival time is given by equation (7), in a transformed form. We proceed to take the anti-logarithm of the transformed model to get the original values of the survival time utilizing equation (4). The goodness-of-fit of the model was very carefully evaluated as follows: (1) the model satisfies all the (1-5) assumptions of a good statistical regression model as we described it in section   The justification of the usefulness/relevance of the proposed statistical model compared to other existing models or findings was assessed and evaluated. Our proposed model identified the 9 risk factors and one interaction term to be significantly contributing to the survival time of patients with MM, given in Table 3. Given any set of values of the significant risk factors that we have identified, we can predict the survival time of a patient with multiple myeloma with at least 94% accuracy. Serum calcium, blood urea nitrogen (BUN)/serum creatinine, and Bence Jone protein in urine (BJPU) were identified to be significantly contributing to the survival time, a finding consistent to that reported by others [22,32]. BJPU was ranked as the highest contributor to the survival time, followed by BUN, and serum calcium was ranked sixth (

Conclusion
We have developed and propose a data-driven statistical model that identifies nine significant risk factors and one interaction term, namely Bence Jone protein in urine, blood urea nitrogen (BUN)/serum creatinine, infections, % myeloid cells in peripheral blood, fractures, serum calcium, gender, platelets and age, and white blood cells & total serum protein that contribute to the survival time of patients diagnosed with multiple myeloma. The proposed model has been evaluated using the statistical model assumptions, coefficient of determination (R 2 along with 2 adjusted R ) statistic, the Akaike information criterion (AIC) of model selection, the prediction error sum of squares (PRESS), the root mean square error (RMSE), the variance inflation factor (VIF), the residual analysis, and the prediction accuracy (the correlation of the actual and predicted survival times based on 80% training set and 20% testing set) to be of (1) Given any set of values of the identified significant risk factors, we can obtain a good estimate/prediction of the survival time of patients diagnosed with MM.
(2) Identifies the individual risk factors and interaction that are significantly contributing to the survival time of MM patients.
(3) We can obtain the ranks of the attributable risk factors based on the percentage of contribution to the survival time of MM patients.
(4) We can perform surface response analysis to assess the contribution by each risk factor as a way to maximize the survival time of multiple myeloma patients.
(5) We can compute confidence limits with a desirable degree of confidence that will be essential in controlling the survival time; for instance, when the survival time of a patient fall below the confidence limit he/she can be said to be in a critical condition, and hence requires immediate attention and treatment. The above statistical findings are with a high degree of accuracy and provide strategies for further improving the therapeutic/ treatment process of the multiple myeloma cancer disease.