vignettes/Goodness_of_fit_Tests.Rmd
Goodness_of_fit_Tests.Rmd
Let denote the time until the occurrence of an event of interest, whose distribution function we denote by . The GofCens package offers various goodness-of-fit techniques to assess whether a univariate sample from , either complete or right-censored, i.e., the observed times are smaller than the actual times of interest, comes from a specified (continuous) distribution , where represents a vector of unknown parameters. Formally, the null hypothesis goodness-of-fit test is , representing a vector of unknown parameters. Specifically, the GofCens package provides implementations of well-known tests such as the Kolmogorov-Smirnov, Cramér-von Mises, and Anderson-Darling tests based on the empirical distribution function for complete data and their extensions for right-censored data. Additionally, GofCens includes a chi-squared-type test based on the squared differences between observed and expected counts using random cells, with an extension tailored for right-censored data.
We will illustrate how the functions of the package works using the following simulated survival times. We generate survival times from a log-normal distribution with location parameter and scale parameter , i.e. , and censoring times from an exponential distribution with scale parameter , i.e. :
The observed right-censored survival times, , and the corresponding event indicators, , are created as follows:
times <- pmin(survt, censt)
delta <- as.numeric(survt <= censt)
In total, () of the survival times of the generated sample are right-censored:
table(delta)
#> delta
#> 0 1
#> 106 194
The Kolmogorov-Smirnov test adapted to right-censored data is
implemented in the function KScens()
of the
GofCens package. This function provides the observed
Kolmogorov-Smirnov statistic adapted to right-censored data and the
p-value that can be computed either via the theoretical approximation or
via bootstrap methods.
We run the KScens()
function to assess the goodness of
fit of the log-normal and the Weibull distributions. The p-value
obtained with the log-normal distribution is, as expected, quite large,
whereas the low p-value in the second example provides large evidence
against the Weibull distribution.
KScens(times, delta, distr = "lognormal")
#> Distribution: lognormal
#>
#> KS Test results:
#> A p-value
#> 0.781 0.116
#>
#> Parameter estimates (se):
#> location scale
#> 1.988 (0.058) 0.883 (0.045)
#>
#> AIC: 1266.899
#> BIC: 1274.307
Note that in this second example, with apply the internal function
print.KScens()
, which allows us to change the default list
output to a table format (outp = "table"
).
set.seed(123)
print(KScens(times, delta, distr = "weibull"), outp = "table")
#> Distribution: weibull
#>
#> KS Test results:
#> ------- | -------
#> Metric | Value
#> ------- | -------
#> A | 1.391
#> p-value | 0.001
#> ------- | -------
#>
#> Parameter estimates:
#> --------- | --------- | ---------
#> Parameter | Value | s.e.
#> --------- | --------- | ---------
#> shape | 1.273 | 0.067
#> scale | 10.98 | 0.621
#> --------- | --------- | ---------
#>
#> AIC: 1304.046
#> BIC: 1311.453
By default the KScens()
function computes the p-value
via bootstrap methods, nonetheless if the argument
boot=FALSE
is used, the computation of the p-value is done
using the theoretical approximation. Let us see an example:
KScens(times, delta, distr = "lognormal", boot = FALSE)
#> Distribution: lognormal
#>
#> KS Test results:
#> A p-value
#> 0.781 0.576
#>
#> Parameter estimates (se):
#> location scale
#> 1.988 (0.058) 0.883 (0.045)
#>
#> AIC: 1266.899
#> BIC: 1274.307
The function CvMcens()
of the GofCens
package provides both the observed Cramér-von Mises statistic adapted to
right-censored data and the p-value obtained via bootstrap methods.
We run the CvMcens()
function to assess the goodness of
fit of the log-normal and the Weibull distributions, as before. Hence,
the null hypotheses in the illustrations remain the same as before, as
do the conclusions: we would conclude that the data may come from a
log-normal distribution, but not from a Weibull distribution.
set.seed(123)
CvMcens(times, delta, distr = "lognormal")
#> Distribution: lognormal
#>
#> CvM Test results:
#> CvM p-value
#> 0.054 0.450
#>
#> Parameter estimates (se):
#> location scale
#> 1.988 (0.058) 0.883 (0.045)
#>
#> AIC: 1266.899
#> BIC: 1274.307
And for the Weibull distribution:
The function ADcens()
of the GofCens
package provides both the observed Anderson-Darling statistic adapted to
right-censored data and the p-value obtained via bootstrap methods.
For example, we can test the null hypothesis that the data come from a log-normal distribution with location and scale parameters equal to and , respectively. In this case, as shown below, the Anderson-Darling test clearly rejects the null hypothesis. Note that the output now displays both the parameters of the null hypothesis and the estimated parameters
set.seed(123)
print(ADcens(times, delta, distr = "lognormal",
params0 = list(location = 2, scale = 1.5)), outp = "table")
#> Distribution: lognormal
#>
#> Null hypothesis:
#> --------- | ---------
#> Parameter | Value
#> --------- | ---------
#> location | 2
#> scale | 1.5
#> --------- | ---------
#>
#> AD Test results:
#> ------- | -------
#> Metric | Value
#> ------- | -------
#> AD | 9.536
#> p-value | 0.001
#> ------- | -------
#>
#> Parameter estimates:
#> --------- | --------- | ---------
#> Parameter | Value | s.e.
#> --------- | --------- | ---------
#> location | 1.988 | 0.058
#> scale | 0.883 | 0.045
#> --------- | --------- | ---------
#>
#> AIC: 1266.899
#> BIC: 1274.307
There is a function implemented computing the three p-values for the
three previous tests simultaneously by means of bootstrapping methods.
The function gofcens()
computes the test statistics of the
Kolmogorov-Smirnov, Cramér-von Mises, and Anderson-Darling tests adapted
to right-censored data and returns the corresponding p values computed
via bootstrap methods.
Let us expose a couple of examples with the gofcens()
function:
set.seed(123)
gofcens(times, delta, distr = "lognormal")
#> Distribution: lognormal
#>
#> Test statistics
#> KS CvM AD
#> 0.781 0.054 0.500
#>
#> p-values
#> KS CvM AD
#> 0.115 0.450 0.520
#>
#> Parameter estimates (se):
#> location scale
#> 1.988 (0.058) 0.883 (0.045)
#>
#> AIC: 1266.899
#> BIC: 1274.307
Now we use the function to assess whether the Weibull distribution fits the data, and we print the results in the table format.
set.seed(123)
print(gofcens(times, delta, distr = "weibull"), outp = "table")
#> Distribution: weibull
#>
#> Tests results:
#> ---------------- | ---------------- | ----------------
#> Test | Statistics value | p-value
#> ---------------- | ---------------- | ----------------
#> KS | 1.391 | 0.001
#> CvM | 0.376 | 0.001
#> AD | 2.828 | 0.001
#> ---------------- | ---------------- | ----------------
#>
#> Parameter estimates:
#> --------- | --------- | ---------
#> Parameter | Value | s.e.
#> --------- | --------- | ---------
#> shape | 1.273 | 0.067
#> scale | 10.98 | 0.621
#> --------- | --------- | ---------
#>
#> AIC: 1304.046
#> BIC: 1311.453
Chi-squared type tests are also implemented, the function
chisqcens()
of the GofCens package uses
bootstrap techniques to compute the p-value.
In this function two random cell numbers are provided: the number chosen by the user (Original) and the final number (Final), which might be smaller than the previous one because of right-censored data. For example, in the following illustrations with the data of the right-censored sample from the log-normal distribution, we choose random cells, but as shown in the output, the number is reduced to .
set.seed(123)
chisqcens(times, delta, M = 8, distr = "lognormal")
#> Distribution: lognormal
#>
#> Chi-squared Test results:
#> Statistic p-value
#> 8.657 0.307
#>
#> Parameter estimates (se):
#> location scale
#> 1.988 (0.058) 0.883 (0.045)
#>
#> Cell numbers:
#> Original Final
#> 8 7
#>
#> AIC: 1266.899
#> BIC: 1274.307
Now we use the function for the Weibull distribution and we print it in the table format.
set.seed(123)
print(chisqcens(times, delta, M = 8, distr = "weibull"), outp = "table")
#> Distribution: weibull
#>
#> Chi-squared Test results:
#> --------- | ---------
#> Metric | Value
#> --------- | ---------
#> Statistic | 24.297
#> p-value | 0.021
#> --------- | ---------
#>
#> Parameter estimates:
#> --------- | --------- | ---------
#> Parameter | Value | s.e.
#> --------- | --------- | ---------
#> shape | 1.273 | 0.067
#> scale | 10.98 | 0.621
#> --------- | --------- | ---------
#>
#> Cell numbers:
#> Original | Final
#> --------- | ---------
#> 8 | 7
#> --------- | ---------
#>
#> AIC: 1304.046
#> BIC: 1311.453
Again, based on the outputs, we would choose the log-normal distribution instead of the Weibull distribution, because its value of the test statistic is clearly smaller and the p-value is far larger.
In this section, we apply the above functions of the GofCens package to determine which parametric model fits best to the survival times of former NBA players.
The data frame nba comes with the GofCens
package and contains the survival times (variable survtime
)
of all
former players of the of the National Basketball Association (NBA) until
July 2019. Survival times are measured as the elapsed time (in years)
from the end of the NBA career until either death
(cens == 1
) or July 31, 2019 (cens == 1
). By
this date,
()
of the former players had died with uncensored post-NBA survival times
ranging from a few days until nearly 70 years.
We apply the gofcens()
function to the logistic and
normal distributions in order to see the results of the
Kolmogorov-Smirnov, Cramér-von Mises, and Anderson-Darling tests. To
reduce the computation times, which would otherwise be very long, we
first select a random sample of survival times of size
.
data("nba")
set.seed(123)
nbasamp <- nba[sample(nrow(nba), 500), ]
set.seed(123)
gofcens(Surv(survtime, cens) ~ 1, nbasamp, distr = "logistic")
#> Distribution: logistic
#>
#> Test statistics
#> KS CvM AD
#> 0.911 0.229 2.450
#>
#> p-values
#> KS CvM AD
#> 0.017 0.123 0.056
#>
#> Parameter estimates (se):
#> location scale
#> 51.882 (1.204) 8.608 (0.579)
#>
#> AIC: 1104.212
#> BIC: 1112.641
Now for the normal distribution:
set.seed(123)
gofcens(Surv(survtime, cens) ~ 1, nbasamp, distr = "normal")
#> Distribution: normal
#>
#> Test statistics
#> KS CvM AD
#> 1.207 0.432 3.787
#>
#> p-values
#> KS CvM AD
#> 0.001 0.046 0.027
#>
#> Parameter estimates (se):
#> location scale
#> 51.704 (1.359) 16.187 (0.969)
#>
#> AIC: 1111.097
#> BIC: 1119.526
The test statistics of all three tests are smaller in the case of the logistic distribution and the corresponding p-values are larger. Thus, we would select the logistic distribution over the normal distribution.