^{1}

^{1}

Logistic regression is the most important tool for data analysis in various fields. The classical approach for estimating parameters is the maximum likelihood estimation, a disadvantage of this method is high sensitivity to outlying observations. Robust estimators for logistic regression are alternative techniques due to their robustness. This paper presents a new class of robust techniques for logistic regression. They are weighted maximum likelihood estimators which are considered as Mallows-type estimator. Moreover, we compare the performance of these techniques with classical maximum likelihood and some existing robust estimators. The results are illustrated depending on a simulation study and real datasets . The new estimators showed the best performance relative to other estimators.

Logistic regression is a proper analysis method to model the data and explain the relationship between the binary response variable and explanatory variables. The maximum likelihood estimator is a common technique of parameter estimation in the binary regression model. Unfortunately, this method does not resistant against atypical observations in data. To handle this problem, many robust estimators an alternative to MLE have been proposed. [

In this article we investigate the use of weight functions introduced by [

The maximum likelihood estimator for the logistic regression model is given in Section 2. In Section 3, we state a review and describe some of the existing robust techniques. We explain the performance of the estimators based on the results of a simulation study and real data in Section 4. The conclusion is given in Section 5.

Suppose the binary response variable

We can write the multiple logistic regression model by:

where

where

• The probit function

• The complementary log-log function

In this article, we focus on a logit function as a link function. The classical maximum likelihood estimator is used to estimating the vector of unknown parameters^{th} observation is given by:

and each observation

then, we take a log-likelihood of above formula:

where

In design experiments, we have repeated observations or trials at each level of the explanatory variables (x). Let ^{th} observations with

however, the likelihood function can be maximized by differentiating it with respect to

where,

where

As a result, the MLE estimator is typically done by solving the score equation:

Equation (5) is nonlinear in

by an iterative method in which each step involves solving a weighted least squares problems of the form:

where

where

with a new vector

An outlier is an observation deviated from the other values in data and produces the large residuals. In the logistic regression model, an outlier can be occurred in the response variables as well as in the predictor variables or in both. In the binary regression model, all the response variables

In this article we proposed a new class of robust techniques for logistic regression, they are weighted maximum likelihood estimators, where the weight depends on the weight functions introduced by [

In the CUBIF estimator, the weights of controlling atypical observations depend on the response variables and the predictor variables, this estimator is also known as the Schweppe class estimator introduced by [^{th} response variable, ^{th} explanatory variables,

where B is a variance covariance matrix, b is bounded on the measure of infinitesimal sensitively and

The weights function in the form of

[

where

where

this estimate is called the weighted maximum likelihood estimate (Mallows-type estimator) and the influence function of WMLE is given by:

where

Similar to the strategy used in constructing the Mallows-type estimator, we proposed a new class of robust techniques, they are the weighted maximum likelihood estimators, with weight depends on the weight functions introduced by [

The weight function we proposed can be defined as: first weight:

second weight:

Then, the weighted maximum likelihood estimators for logistic regression can be obtained by a solution of the form of:

For these weights no observation is trimmed we used the modified algorithm for Mallows-type estimator of [

In order to examine the performance of the estimators, two approaches have been taken. The first includes simulated models for comparing the new techniques with the classical MLE, Mallows type estimator for [

In this subsection, a simulation study was carried out to examine the performance of new robust techniques (WMLEw_{1}, WMLEw_{2}) and compare with MLE, conditionally unbiased bounded influence (CUBI) of [

The performance of these estimators is examined based on the Bias and mean squared error (MSE) for different scenarios. However, the estimator which has small Bias and MSE is a good one. In each scenario run included over 1000 repetitions. Therefore, the bias and mean squared error for each parameter are computed as follows:

and

_{1} and WMLEw_{2} estimators perform less compared to other estimators. It can observe that the bias and mean squared errors decrease when the sample size is increased. As can be seen from _{1} and WMLEw_{2} have overall the best performance among all compared estimators for different sample sizes.

The results of moderate in _{1} and WMLEw_{2} perform better than other estimators in the term of bias and mean squared errors. However, the classical maximum likelihood estimates perform poorly in the contaminated model due to the sensitivity of outliers. In summary, the two new estimators show the best performance among all compared techniques in contaminated data. Moreover, these new estimators have reasonable perform in clean data.

Methods | n = 100 | n = 200 | n = 300 | n = 400 | ||||
---|---|---|---|---|---|---|---|---|

Bias | MSE | Bias | MSE | Bias | MSE | Bias | MSE | |

MLE | 0.1872 | 0.0881 | 0.1269 | 0.0413 | 0.1018 | 0.0264 | 0.0881 | 0.0195 |

MALLOWS | 0.1839 | 0.0873 | 0.1269 | 0.0410 | 0.0967 | 0.0247 | 0.0861 | 0.0194 |

CUBIF | 0.1853 | 0.0884 | 0.1290 | 0.0422 | 0.0992 | 0.0254 | 0.0855 | 0.0190 |

WMLEw_{1} | 0.3442 | 0.2947 | 0.3122 | 0.1302 | 0.0987 | 0.0858 | 0.2278 | 0.0742 |

WMLEw_{2} | 0.3240 | 0.2991 | 0.3318 | 0.1471 | 0.2018 | 0.1723 | 0.2464 | 0.0961 |

Methods | n = 100 | n = 200 | n = 300 | n = 400 | ||||
---|---|---|---|---|---|---|---|---|

Bias | MSE | Bias | MSE | Bias | MSE | Bias | MSE | |

MLE | 0.8736 | 1.5627 | 0.8203 | 1.2564 | 0.7805 | 1.1229 | 0.7808 | 1.0993 |

MALLOWS | 0.8742 | 1.5535 | 0.7988 | 1.2115 | 0.7835 | 1.1340 | 0.7947 | 1.1442 |

CUBIF | 0.8799 | 1.5888 | 0.7811 | 1.1654 | 0.7754 | 1.1215 | 0.7728 | 1.0486 |

WMLEw_{1} | 0.3322 | 0.2800 | 0.0691 | 0.1492 | 0.5345 | 0.4209 | 0.4261 | 0.2946 |

WMLEw_{2} | 0.3288 | 0.2424 | 0.0165 | 0.1578 | 0.4666 | 0.3274 | 0.4301 | 0.3053 |

Methods | n = 100 | n = 200 | n = 300 | n = 400 | ||||
---|---|---|---|---|---|---|---|---|

Bias | MSE | Bias | MSE | Bias | MSE | Bias | MSE | |

MLE | 0.9062 | 1.6799 | 0.8301 | 1.3019 | 0.7858 | 1.1394 | 0.7921 | 1.1371 |

MALLOWS | 0.9077 | 1.7056 | 0.7981 | 1.2197 | 0.8041 | 1.1953 | 0.7969 | .1502 |

CUBIF | 0.9033 | 1.6877 | 0.8080 | 1.2462 | 0.7867 | 1.1581 | 0.7841 | 1.1219 |

WMLEw_{1} | 0.4358 | 0.3190 | 0.0604 | 0.1368 | 0.4961 | 0.3526 | 0.4606 | 0.3089 |

WMLEw_{2} | 0.4024 | 0.2718 | 0.0047 | 0.1350 | 0.4410 | 0.2812 | 0.4373 | 0.2805 |

Methods | n = 100 | n = 200 | n = 300 | n = 400 | ||||
---|---|---|---|---|---|---|---|---|

Bias | MSE | Bias | MSE | Bias | MSE | Bias | MSE | |

MLE | 0.9997 | 2.1176 | 0.8496 | 1.4131 | 0.8194 | 1.2636 | 0.8145 | 1.2167 |

MALLOWS | 0.9746 | 2.0663 | 0.8571 | 1.4281 | 0.8328 | 1.3034 | 1.8308 | 1.2621 |

CUBIF | 1.0167 | 2.1856 | 0.8710 | 1.8890 | 0.8349 | 1.3201 | 0.8134 | 1.2106 |

WMLEw_{1} | 0.2511 | 0.1654 | 0.0352 | 0.1548 | 0.2393 | 0.1224 | 0.2494 | 0.1150 |

WMLEw_{2} | 0.2350 | 0.1556 | 0.0729 | 0.1510 | 0.1897 | 0.0925 | 0.2404 | 0.1013 |

The datasets analyzed here. This data includes 33 leukemia patients. Three variables were measured for each patient: Time, AG and WBC. The response variable is a survival time patient in weeks, we coded into (1 = the patient survived more than 52 weeks, 0 = otherwise). The two explanatory variables are: WBC measured a white blood cell count of patient and AG is a binary variable (1 = present of morphologic characteristic of white blood cells, 0 = absent of morphologic characteristic of white blood cells) according to an identification method of atypical observation in the leukemia data, the observation number 17 looks like atypical. A logistic regression model was fitted using binary survival time y as the response variable and AG and WBC as the predictor variables. The estimators examined here are new weighted maximum likelihood estimates (WMLEw_{1}, WMLEw_{2}), MLE, MLE_{17} (MLE_{17} is the maximum likelihood estimator for clean data after excluding observation number 17), Mallows (Mallows type estimator) and CUBIF (conditionally unbiased bounded-influence function estimator).

It can be observed from _{1}, WMLEw_{2}) are showed the best performance among all other estimators for the leukemia data. However, Mallows estimates are sensibly close to the MLE_{17}.

In this study, we introduced two new robust techniques of logistic regression, also known as weighted maximum likelihood estimators. In order to examine the performance of new techniques, we conducted simulation experiments under different scenarios and real datasets. The classical maximum likelihood estimates show the lack of robustness when outliers are present. Our simulation experiments for uncontaminated models demonstrated that the MLE, Mallows and CUBIF estimators are fairly perform close to each other, while, the new weighted techniques perform less compared to other estimators. In both simulation study under different contaminated scenarios and real datasets, the new proposed weighted maximum likelihood techniques showed the best performance among all compared estimators. The new techniques used here to construct robust estimators can also be extension to other generalized linear models like Poisson regression model and negative binomial model.

Methods | Intercept | WBC | AG | |||
---|---|---|---|---|---|---|

Est. | S.E. | Est. | S.E. | Est. | S.E. | |

MLE | −1.3073 | 0.8145 | 0.3717 | 0.0186 | 2.2617 | 0.9522 |

MLE32 | 0.2119 | 1.0830 | −0.0002 | 0.0001 | 2.5580 | 1.2341 |

MALLOWS | 0.1602 | 1.0697 | −0.2245 | 0.0129 | 2.5252 | 1.2159 |

CUBIF | −1.4503 | 1.8504 | −0.0527 | 0.0337 | 0.2202 | 0.9756 |

WMLEw_{1} | −0.0011 | 0.3992 | 0.0012 | 0.0070 | 1.4744 | 0.4769 |

WMLEw_{2} | −1.5486 | 0.4588 | −0.0064 | 0.0066 | 1.3786 | 0.5481 |

The authors declare no conflicts of interest regarding the publication of this paper.

Ahmed, I.A.I. and Cheng, W.H. (2020) The Performance of Robust Methods in Logistic Regression Model. Open Journal of Statistics, 10, 127-138. https://doi.org/10.4236/ojs.2020.101010