šŸ•·ļø Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:
Response:
Calculated Shard: 159 (from laksa086)

2. Crawled Status Check

Query:
Response:

3. Robots.txt Check

Query:
Response:

4. Spam/Ban Check

Query:
Response:

5. Seen Status Check

ā„¹ļø Skipped - page is already crawled

šŸ“„
INDEXABLE
āœ…
CRAWLED
1 day ago
šŸ¤–
ROBOTS ALLOWED

Page Info Filters

FilterStatusConditionDetails
HTTP statusPASSdownload_http_code = 200HTTP 200
Age cutoffPASSdownload_stamp > now() - 6 MONTH0.1 months ago
History dropPASSisNull(history_drop_reason)No drop reason
Spam/banPASSfh_dont_index != 1 AND ml_spam_score = 0ml_spam_score=0
CanonicalPASSmeta_canonical IS NULL OR = '' OR = src_unparsedNot set

Page Details

PropertyValue
URLhttps://timeseriesreasoning.com/contents/zero-inflated-poisson-regression-model/
Last Crawled2026-04-07 05:28:42 (1 day ago)
First Indexed2021-10-14 23:46:39 (4 years ago)
HTTP Status Code200
Meta TitleThe Zero Inflated Poisson Regression Model - Statistical Modeling and Forecasting
Meta DescriptionThe Zero Inflated Poisson Regression Model can be used to model counts based data sets which contain an excess of zero valued data points..
Meta Canonicalnull
Boilerpipe Text
Introduction to the ZIP model and a Python tutorial on training a ZIP model on a dataset having excess zeroes In this section, we’ll learn how to build a regression model for counts based datasets in which the dependent variable contains an excess of zero-valued data . Counts datasets are ones where the dependent variable is an event such as: Number of vehicles crossing an intersection per hour. Number of ER visits happening each month Number of motor vehicle insurance claims filed per year Number of defects found in a mass produced printed circuit board. Data set containing many zero counts (Image byĀ  Author ) Many real world phenomena produce counts that are almost always zero. For example: Number of times a machine fails each month Number of exoplanets discovered each year The number of billionaires living in every single city in the world. Such data are hard to deal with using traditional models for counts data such as the Poisson , the Binomial or the Negative Binomial regression models. This is because such data sets contain more number of zero valued counts than what one would expect to observe using the traditional model’s probability distribution . For example, if you assume that a phenomenon obeys the following Poisson(5) process, you would expect to see zero counts no more than 0.67% of the time: A Poisson(5) process will generate zeros in about 0.67% of observations (Image byĀ  Author ) If you observe zero counts far more often than that, the data set contains an excess of zeroes. If you use a standard Poisson or Binomial or NB regression model on such data sets, it can fit badly and will generate poor quality predictions, no matter how much you tweak its parameters. So what is a modeler to do when faced with such data with excess zeros? Fortunately, there is a way to modify a standard counts model such as Poisson or Negative Binomial to account for the presence of the extra zeroes. In fact, there happen to be at least two ways to do this. One technique is known as the Hurdle model and the second technique is known the Zero-Inflated model . In this section, we’ll look at the zero-inflated regression model in some detail. Specifically, we’ll focus on the Zero Inflated Poisson regression model , often referred to as the ZIP model . The structure of a ZIP model Let’s briefly look at the structure of a regular Poisson model before we see how its structure is modified to handle excess zero counts. Imagine a data set containing n samples and p regression variables per sample. Therefore, the regression variables X can be represented by a matrix of size (n x p) and each row x_i in the X matrix is a vector of size (1 x p) corresponding the dependent variable value y_i : A data set ( y, X ) in matrix notation (Image byĀ  Author ) If we assume that y is a Poisson distributed random variable, we can build a Poisson regression model for this data set. The Poisson model is made up of two parts: A Poisson P robability M ass F unction (PMF) denoted as P(y_i=k) used to calculate the probability of observing k events in any unit interval given a mean event rate of Ī» events / unit time. A link function that is used to express the mean rate Ī» as a function of the regression variables X . This is illustrated in the figure below: Probability Mass Function of the standard Poisson regression model (Image byĀ  Author ) Normally, we assume that there is some underlying process that is producing the observed counts as per the Poisson PMF: P(y_i=k) . The intuition behind the Zero Inflated Poisson model is that there is a second underlying process that is determining whether a count is zero or non-zero . Once a count is determined to be non-zero, the regular Poisson process takes over to determine its actual non-zero value based on the Poisson process’s PMF. Thus, a ZIP regression model consists three parts: A PMF P(y_i=0) which is used to calculate the probability of observing a zero count. A second PMF P(y_i=k) which is used to calculate the probability of observing k events, given that k > 0 . A link function that is used to express the mean rate Ī» as a function of the regression variables X . This is illustrated in the following figure: Probability Mass Function of the ZIP model (Image byĀ  Author ) As before, y_i is the random variable that denotes the observed count corresponding to the regression variables row x_i =[x_i1, x_i2, x_i3,…,x_ip]. Ļ•_i is a measure of the proportion of excess zeroes corresponding to the ith row ( y_i , x_i) in the data set;Ā . Getting to knowĀ Ļ•_i A simple way to understand Ļ•_i is as follows: Imagine that you take 1000 observations of y_i , each one with the same combination of regression variable values x_i =[x_i1, x_i2, x_i3,…,x_ip]. Since y_i is a random variable that follows the Poisson distribution, you may see a different value of y_i in each one of the 1000 observations. Suppose that out of the 1000 y_i values you observe, you observe 874 zero values. You determine that out of these 874 zero values, the regular Poisson distribution that you have assumed for y_i , will be able to explain only up to 7 zero values. So the remaining 867 zero values are excess zero observations. So for the ith row in your data set, Ļ•_i =867/1000 = 0.867. When the data set does not have any excess zeroes in the dependent variable, the value of Ļ• works out to be zero and the PMF of the ZIP model reduces to the PMF of the standard Poisson model (you can easily verify this by setting Ļ• to 0 in the ZIP model’s PMF). How to estimateĀ  Ļ•? So how can we estimate the value of Ļ•_i ? A simple and crude way of estimating Ļ•_i is by setting each Ļ• _i to the following ratio: A simple but inaccurate way to estimate Ļ• _i in the ZIP model (Image byĀ  Author ) Perhaps a more realistic way of calculating Ļ•_i is by estimating it as a function of regression variables X . This is usually done by transforming the y variable to a binary 0/1 random variable y’ ( y_prime ) which takes the value 0 if the underlying y is 0, and 1 in all other cases. Then we fit a Logistic regression model on the transformed y’ . We then train the Logistic regression model on the data set [ X, y’ ] and it yields a vector of fitted probabilities µ_fitted =[µ_1, µ_2, µ_3,…,µ_n], (because that’s what a Logistic regression model does) . Once we get the µ_fitted vector, we simply set it to the vector Ļ•. Thus [Ļ•_1=µ_1, Ļ•_2=µ_2, Ļ•_3=µ_3,…,Ļ•_n=µ_n] . The above process of estimating Ļ• is illustrated below : The training sequence for estimating excess zeros parameter Ļ• in a ZIP model (Image byĀ  Author ) Once the Ļ• vector is estimated, we plug it into the probability functions of the ZIP model and use what is known as the M aximum L ikelihood E stimation ( MLE ) technique to train the ZIP model on the data set with excess counts. The following figure illustrates the training sequence of the ZIP model: Training sequence of the ZIP model (Image byĀ  Author ) Thankfully, there are many statistics packages that automate this entire procedure of estimating Ļ• and using the estimated Ļ• to train the ZIP model using the MLE technique on your data set. In the rest of this section, we’ll use the Python statsmodels library to build and train a ZIP model in a single line of code. How to train the ZIP model usingĀ Python In our Python tutorial on the ZIP model, we’ll use a data set of camping trips taken by 250 groups of people: The camping trips data set (Image byĀ  Author ) The data set is available here . Here are a couple of salient features of this data set: The campers may or may not have done some fishing during their trip. If a group did some fishing, they would have caught zero or more fish. We want to estimate not only how many fish were caught (if there was fishing done by a camping group), but also the probability that the camping group caught any fish at all. Thus, there are two distinct data generation processes involved: A process that determines whether or not a camping group indulged in a successful fishing activity: The ZIP model will internally use a Logistic Regression model that was explained earlier to model this binary process. A second process that determines how many fish were caught by a camping group, given that there was at least one fish caught by the group: The ZIP model will use a regular Poisson model for modeling this second process. Variables in the data set The camping trips data set contains the following variables: FISH_COUNT: The number of fish that were caught. This will be our dependent variable y . LIVE_BAIT: A binary variable indicating whether live bait was used. CAMPER: Whether the fishing group used a camper van. PERSONS: Total number of people in the fishing group. Note that in some groups, none of them may have fished. CHILDREN: The number of children in the camping group. Here is a frequency distribution of the dependent FISH_COUNT variable: Frequency distribution of FISH_COUNT (Image byĀ  Author ) As we can see, there may be excess zeroes in this data set. We’ll train a ZIP model on this data set to test this theory and hopefully achieve a better fit than the regular Poisson model. Regression Goal Our regression goals on this data set are as follows: Predict the number of fish caught (FISH_COUNT) by a camping group based on the values of LIVE_BAIT, CAMPER, PERSONS and CHILDREN variables. Regression Strategy Our regression strategy will be as follows: FISH_COUNT will be the dependent variable y , and [LIVE_BAIT, CAMPER, PERSONS and CHILDREN] will be the explanatory variables X . We’ll use the Python statsmodels library to train the ZIP regression model on the ( y, X ) data set. We’ll make some predictions using the ZIP model on a test data set that the model has not seen during its training. Let’s begin by import all the required packages: import pandas as pd from patsy import dmatrices import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt Next, we’ll load the fish data set into memory. Here is the link to the data set : df = pd.read_csv('fish.csv', header=0) Let’s print the top few rows of the data set: print(df.head(10)) Top 10 rows of the data set (Image byĀ  Author ) Let’s also print out the frequency distribution of FISH_COUNT values: df.groupby('FISH_COUNT').count() Frequency distribution of fish counts (Image byĀ  Author ) Create the training and test data sets. Note that for now, we are not doing a stratified random split: mask = np.random.rand(len(df)) < 0.8 df_train = df[mask] df_test = df[~mask] print('Training data set length='+str(len(df_train))) print('Testing data set length='+str(len(df_test))) >> Training data set length=196 >> Testing data set length=54 Setup the regression expression in Patsy notation. We are telling Patsy that FISH_COUNT is our dependent variable y and it depends on the regression variables LIVE_BAIT, CAMPER, PERSONS and CHILDREN: expr = 'FISH_COUNT ~ LIVE_BAIT + CAMPER + CHILDREN + PERSONS' Let’s use Patsy to carve out the X and y matrices for the training and testing data sets. y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe') Using statsmodels’s ZeroInflatedPoisson class, let’s build and train a ZIP regression model on the training data set. But before we do so, let me explain how to use two parameters that the class constructor takes: inflation: The ZeroInflatedPoisson model class will internally use a LogisticRegression model to estimate the parameter Ļ• . Hence we set the model parameter inflation to ’logit ’. We can also experiment with setting it to other Binomial link functions such as ā€˜probit’. exog_infl: We also want to ask the ZIP model to estimate Ļ• as a function of the same set of regression variables as the parent model, namely: LIVE_BAIT, CAMPER, PERSONS and CHILDREN. Hence we set the parameter exog_infl to X_train. If you want to use only a subset of X_train, you can do so, or you can set exog_infl to an entirely different set of regression variables. The below line builds and trains the ZIP model on our training data set in a single line of code. zip_training_results = sm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, inflation='logit').fit() Print the training summary: print(zip_training_results.summary()) Here is the training summary (I have highlighted the important elements in the output): Training summary of the ZIP model (Image byĀ  Author ) Interpreting the training output The blue box contains information about variables that the nested Logistic Regression model has used to estimate the probability Ļ• of whether or not any fish were caught by a camping group . Regression coefficients, their standard errors and z-scores from the fitted ZIP model (Image byĀ  Author ) Notice that the Logistic regression model did not find Intercept, LIVE_BAIT and CAMPER variables useful for estimating Ļ• . Their regression coefficients were found to be NOT statistically significant at the 95% confidence level, as indicated by the respective p values: inflate_Intercept= 0.425 ,Ā  inflate_LIVE_BAIT= 0.680 andĀ  inflate_CAMPER= 0.240, which are all greater than 0.05 (i.e. 5% error threshold). Observation 1 The only two variables that the Logistic Regression model determined as useful for estimating the probability of whether or not any fish were caught were CHILDREN and PERSONS. Observation 2 The regression coefficient of PERSONS is negative (inflate_PERSONS -1.2193 ) which means that as the number of people in the camping group increases, probability of no fish being caught by that group decreases. This is in line with our intuition. The red box contains information about variables that the parent Poisson model used to estimate FISH_COUNT on the condition that FISH_COUNT > 0. Observation 3 We can see that the coefficients for all 5 regression variables are statistically significant at a 99% confidence level, as evidenced by their p value which is less than 0.01. In fact, the p value is less than 0.001 for all 5 variables, hence it is showing up as 0.000. Observation 4 The coefficient for CHILDREN is negative (CHILDREN -1.0810), meaning that as the number of children in the camping group goes up, the number of fish caught by that group goes down! Observation 5 The Maximized Log-Likelihood of this model is -566.43. This value is useful for comparing the goodness-of-fit of the model with that of other models. Observation 6 Finally, note that the training algorithm of the ZIP model was not able to converge on the training data set as indicated by the following: If it had converged, perhaps it would have resulted in a better fit. We could try to fix that by passing in a maxiter =100 parameter into the fit() method. Prediction We’ll get the ZIP model’s predictions on the test data set and calculate the root mean square error w.r.t. the actual values: zip_predictions = zip_training_results.predict(X_test,exog_infl=X_test) predicted_counts=np.round(zip_predictions) actual_counts = y_test[dep_var] print('ZIP RMSE='+str(np.sqrt(np.sum(np.power(np.subtract(predicted_counts,actual_counts),2))))) >> ZIP RMSE= 55.65069631190611 Let’s plot the predicted versus actual fish counts: fig = plt.figure() fig.suptitle('Predicted versus actual counts using the ZIP model') predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted') actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual') plt.legend(handles=[predicted, actual]) plt.show() We see the following plot: Predicted versus actual fish caught (Image byĀ  Author ) This completes our look at the Zero-Inflated Poisson regression model. Citations and Copyrights Papers Lambert, Diane. ā€œZero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.ā€ Technometrics, vol. 34, no. 1 , 1992, pp. 1–14. JSTOR, www.jstor.org/stable/1269547 . Books Cameron A. C. and Trivedi P. K., Regression Analysis of Count Data , Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013. Images All images are copyright Sachin Date under CC-BY-NC-SA , unless a different source and copyright are mentioned underneath the image. PREVIOUS : The Generalized Poisson Regression Model NEXT : Fitting Linear Regression Models on Count BasedĀ Data Sets UP: Table of Contents
Markdown
[![Statistical Modeling and Forecasting](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2024/05/cropped-Statistical_transparent.png?fit=65%2C60&ssl=1)](https://timeseriesreasoning.com/) [Statistical Modeling and Forecasting](https://timeseriesreasoning.com/ "Statistical Modeling and Forecasting") Learn statistics, one story at a time. With tutorials in Python Search Icon Menu Toggle Icon [Skip to content](https://timeseriesreasoning.com/contents/zero-inflated-poisson-regression-model/#content) - [Home](https://timeseriesreasoning.com/) ![The Zero Inflated Poisson Regression Model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2024/05/HERO_1280px-Deschutes_Wild_and_Scenic_River_28260152076.jpg?resize=830%2C374&ssl=1) # The Zero Inflated Poisson Regression Model ###### Introduction to the ZIP model and a Python tutorial on training a **ZIP model** on a dataset having excess zeroes *** In this section, we’ll learn how to build a regression model for **counts based datasets** in which the dependent variable contains **an excess of zero-valued data**. **Counts datasets** are ones where the dependent variable is an event such as: - Number of vehicles crossing an intersection per hour. - Number of ER visits happening each month - Number of motor vehicle insurance claims filed per year - Number of defects found in a mass produced printed circuit board. ![Data set containing many zero counts ](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/e524c-14sguqptvctofho-1_gqf1g.png?w=768&ssl=1) Data set containing many zero counts (Image by [Author](https://www.linkedin.com/in/sachindate/)) Many real world phenomena produce counts that are almost always zero. For example: - Number of times a machine fails each month - Number of exoplanets discovered each year - The number of billionaires living in every single city in the world. Such data are hard to deal with using traditional models for counts data such as the [**Poisson**](https://timeseriesreasoning.com/contents/poisson-regression-model/), the [**Binomial**](https://towardsdatascience.com/the-binomial-regression-model-everything-you-need-to-know-5216f1a483d3) or the [**Negative Binomial**](https://timeseriesreasoning.com/contents/negative-binomial-regression-model/) regression models. This is because such data sets **contain more number of zero valued counts than what one would expect to observe using the traditional model’s probability distribution**. For example, if you assume that a phenomenon obeys the following *Poisson(5)* process, you would expect to see zero counts no more than 0.67% of the time: ![A Poisson(5) process will generate zeros in about 0.67% of observations](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/11386-18vpaxiswoxqaupc_jc8zoq.png?w=768) A Poisson(5) process will generate zeros in about 0.67% of observations (Image by [Author](https://www.linkedin.com/in/sachindate/)) If you observe zero counts far more often than that, the data set contains **an excess of zeroes.** If you use a standard Poisson or Binomial or NB regression model on such data sets, it can fit badly and will generate poor quality predictions, no matter how much you tweak its parameters. So what is a modeler to do when faced with such data with excess zeros? *** ## The Zero Inflated Poisson Regression model Fortunately, there is a way to modify a standard counts model such as Poisson or Negative Binomial to account for the presence of the extra zeroes. In fact, there happen to be at least two ways to do this. One technique is known as the **Hurdle model** and the second technique is known the **Zero-Inflated model**. In this section, we’ll look at the zero-inflated regression model in some detail. Specifically, we’ll focus on the **Zero Inflated Poisson regression model**, often referred to as the **ZIP model**. ### The structure of a ZIP model Let’s briefly look at the structure of a regular Poisson model before we see how its structure is modified to handle excess zero counts. Imagine a data set containing *n* samples and *p* regression variables per sample. Therefore, the regression variables ***X*** can be represented by a matrix of size *(n x p)* and each row ***x\_i*** in the ***X*** matrix is a vector of size *(1 x p)* corresponding the dependent variable value *y\_i*: ![A data set (y, X) in matrix notation](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/e1404-1id8hghqgszeay_3r-zphga.png?w=768) A data set (***y, X***) in matrix notation (Image by [Author](https://www.linkedin.com/in/sachindate/)) If we assume that ***y*** is a Poisson distributed random variable, we can build a Poisson regression model for this data set. The Poisson model is made up of two parts: 1. A Poisson **P**robability **M**ass **F**unction (PMF) denoted as *P(y\_i=k)* used to calculate the probability of observing *k* events in any unit interval given a mean event rate of Ī» events / unit time. 2. A link function that is used to express the mean rate Ī»as a function of the regression variables ***X*.** This is illustrated in the figure below: ![Probability Mass Function of the standard Poisson regression model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/180c1-1ci0iqsqnzgzfrmil9akjza.png?w=768) Probability Mass Function of the standard Poisson regression model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Normally, we assume that there is some underlying process that is producing the observed counts as per the Poisson *PMF: P(y\_i=k)*. The intuition behind the Zero Inflated Poisson model is that ***there is a second underlying process that is determining whether a count is zero or non-zero***. Once a count is determined to be non-zero, the regular Poisson process takes over to determine its actual non-zero value based on the Poisson process’s PMF. Thus, a **ZIP** regression model consists three parts: 1. A PMF *P(y\_i=0)* which is used to calculate the probability of observing a zero count. 2. A second PMF *P(y\_i=k)* which is used to calculate the probability of observing *k* events, *given that k \> 0*. 3. A link function that is used to express the mean rate Ī»as a function of the regression variables ***X*.** This is illustrated in the following figure: ![Probability Mass Function of the ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/f782f-1q9dtw-96auhefd920nv3-w.png?w=768) Probability Mass Function of the ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) As before, *y\_i* is the random variable that denotes the observed count corresponding to the regression variables row ***x\_i****\=\[x\_i1, x\_i2, x\_i3,…,x\_ip\].* *Ļ•\_i* is a measure of the proportion of excess zeroes corresponding to the ith row ***(****y\_i****, x\_i)*** in the data set; . ### Getting to know Ļ•\_i A simple way to understand *Ļ•\_i* is as follows: Imagine that you take 1000 observations of *y\_i*, each one with the ***same*** combination of regression variable values ***x\_i****\=\[x\_i1, x\_i2, x\_i3,…,x\_ip\].* Since *y\_i is* a *random variable* that follows the Poisson distribution, you may see a different value of *y\_i* in each one of the 1000 observations. Suppose that out of the 1000 *y\_i* values you observe, you observe 874 zero values. You determine that out of these 874 zero values, the regular Poisson distribution that you have assumed for *y\_i*, will be able to explain only up to 7 zero values. So the remaining 867 zero values are excess zero observations. So for the *ith* row in your data set, *Ļ•\_i* =867/1000 = 0.867. When the data set does not have any excess zeroes in the dependent variable, the value of ***Ļ•*** works out to be zero and the PMF of the ZIP model reduces to the PMF of the standard Poisson model (you can easily verify this by setting ***Ļ•*** to 0 in the ZIP model’s PMF). ### How to estimate *Ļ•?* So how can we estimate the value of *Ļ•\_i*? A simple and crude way of estimating *Ļ•\_i* is by setting each ***Ļ•****\_i* to the following ratio: ![A simple but inaccurate way to estimate Ļ•\_i in the ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/68260-1brs2o3bt5bzz1hjs9sctcq.png?w=768) A simple but inaccurate way to estimate ***Ļ•***\_i in the ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Perhaps a more realistic way of calculating *Ļ•\_i* is by estimating it as a function of regression variables ***X***. This is usually done by transforming the ***y*** variable to a binary 0/1 random variable ***y’*** (***y\_prime***) which takes the value 0 if the underlying ***y*** is 0, and 1 in all other cases. Then we fit a **Logistic regression model** on the transformed ***y’***. We then train the Logistic regression model on the data set \[***X, y’***\] and it yields a vector of fitted probabilities ***µ\_fitted****\=\[µ\_1, µ\_2, µ\_3,…,µ\_n\],* (because that’s what a Logistic regression model does)*.* Once we get the ***µ\_fitted*** *vector, we simply set it to the* ***vector Ļ•.*** *Thus \[Ļ•\_1=µ\_1, Ļ•\_2=µ\_2, Ļ•\_3=µ\_3,…,Ļ•\_n=µ\_n\]*. The above process of estimating ***Ļ•*** is illustrated below*:* ![The training sequence for estimating excess zeros parameter Ļ• in a ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/f2e9b-1elxe7t8crnog3x7teohnlq.png?w=768) The training sequence for estimating *excess zeros parameter* ***Ļ•*** in a ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Once the ***Ļ•*** vector is estimated, we plug it into the probability functions of the ZIP model and use what is known as the **M**aximum **L**ikelihood **E**stimation (**MLE**) technique to train the ZIP model on the data set with excess counts. The following figure illustrates the training sequence of the **ZIP** model: ![Training sequence of the ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/2c45f-1pynehn4r7fmmc5bf_yi4cw.png?w=768) Training sequence of the ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Thankfully, there are many statistics packages that automate this entire procedure of estimating ***Ļ•*** and using the estimated ***Ļ•*** to train the ZIP model using the MLE technique on your data set. In the rest of this section, we’ll use the Python [statsmodels](https://www.statsmodels.org/stable/index.html) library to build and train a ZIP model in a single line of code. *** ## How to train the ZIP model using Python In our Python tutorial on the ZIP model, we’ll use a data set of camping trips taken by 250 groups of people: ![The camping trips data set](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/1db1e-1u6wyuqdxda2kumuun5pntg.png?w=768) The camping trips data set (Image by [Author](https://www.linkedin.com/in/sachindate/)) The [data set is available here](https://gist.github.com/sachinsdate/09cfd42b7701c48ec68b04c786786434). Here are a couple of salient features of this data set: - The campers may or may not have done some fishing during their trip. - If a group did some fishing, they would have caught zero or more fish. - We want to estimate not only how many fish were caught (if there was fishing done by a camping group), but also the probability that the camping group caught any fish at all. Thus, there are two distinct data generation processes involved: 1. A process that determines whether or not a camping group indulged in a successful fishing activity: The ZIP model will internally use a Logistic Regression model that was explained earlier to model this binary process. 2. A second process that determines how many fish were caught by a camping group, given that there was at least one fish caught by the group: The ZIP model will use a regular Poisson model for modeling this second process. **Variables in the data set** The camping trips data set contains the following variables: **FISH\_COUNT:** The number of fish that were caught. This will be our dependent variable ***y***. **LIVE\_BAIT:** A binary variable indicating whether live bait was used. **CAMPER:** Whether the fishing group used a camper van. **PERSONS:** Total number of people in the fishing group. Note that in some groups, none of them may have fished. **CHILDREN:** The number of children in the camping group. Here is a frequency distribution of the dependent FISH\_COUNT variable: ![Frequency distribution of FISH\_COUNT](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/74fe7-1kdio-becuqax7nzkkrowug.png?w=768) Frequency distribution of FISH\_COUNT (Image by [Author](https://www.linkedin.com/in/sachindate/)) As we can see, there *may* be excess zeroes in this data set. We’ll train a ZIP model on this data set to test this theory and hopefully achieve a better fit than the regular Poisson model. ### Regression Goal Our regression goals on this data set are as follows: Predict the number of fish caught (FISH\_COUNT) by a camping group based on the values of LIVE\_BAIT, CAMPER, PERSONS and CHILDREN variables. ### Regression Strategy Our regression strategy will be as follows: 1. FISH\_COUNT will be the dependent variable ***y***, and \[LIVE\_BAIT, CAMPER, PERSONS and CHILDREN\] will be the explanatory variables ***X***. 2. We’ll use the Python ***statsmodels library*** to train the ZIP regression model on the (***y, X***) data set. 3. We’ll make some predictions using the ZIP model on a test data set that the model has not seen during its training. Let’s begin by import all the required packages: ``` import pandas as pd from patsy import dmatrices import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt ``` Next, we’ll load the fish data set into memory. Here is the [link to the data set](https://gist.github.com/sachinsdate/09cfd42b7701c48ec68b04c786786434): ``` df = pd.read_csv('fish.csv', header=0) ``` Let’s print the top few rows of the data set: ``` print(df.head(10)) ``` ![Top 10 rows of the data set](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/510be-1kc8fohge7guekbfolhyekg.png?w=768) Top 10 rows of the data set (Image by [Author](https://www.linkedin.com/in/sachindate/)) Let’s also print out the frequency distribution of FISH\_COUNT values: ``` df.groupby('FISH_COUNT').count() ``` ![Frequency distribution of fish counts](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/559f8-1lph-kmazbzfjhljz6yp0iq.png?w=768) Frequency distribution of fish counts (Image by [Author](https://www.linkedin.com/in/sachindate/)) Create the training and test data sets. Note that for now, we are not doing a stratified random split: ``` mask = np.random.rand(len(df)) < 0.8 df_train = df[mask] df_test = df[~mask] print('Training data set length='+str(len(df_train))) print('Testing data set length='+str(len(df_test))) ``` ``` >> Training data set length=196 >> Testing data set length=54 ``` Setup the regression expression in [**Patsy**](https://patsy.readthedocs.io/en/latest/quickstart.html)notation. We are telling Patsy that FISH\_COUNT is our dependent variable ***y*** and it depends on the regression variables LIVE\_BAIT, CAMPER, PERSONS and CHILDREN: ``` expr = 'FISH_COUNT ~ LIVE_BAIT + CAMPER + CHILDREN + PERSONS' ``` Let’s use Patsy to carve out the ***X*** and ***y*** matrices for the training and testing data sets. ``` y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe') ``` Using statsmodels’s [**ZeroInflatedPoisson**](https://www.statsmodels.org/stable/generated/statsmodels.discrete.count_model.ZeroInflatedPoisson.html) class, let’s build and train a ZIP regression model on the training data set. But before we do so, let me explain how to use two parameters that the class constructor takes: - ***inflation:*** The *ZeroInflatedPoisson* model class will internally use a *LogisticRegression* model to estimate the parameter ***Ļ•***. Hence we set the model parameter *inflation to ’logit*’. We can also experiment with setting it to other Binomial link functions such as ā€˜probit’. - ***exog\_infl:*** We also want to ask the ZIP model to estimate *Ļ•* as a function of the same set of regression variables as the parent model, namely: LIVE\_BAIT, CAMPER, PERSONS and CHILDREN. Hence we set the parameter *exog\_infl* to X\_train. If you want to use only a subset of X\_train, you can do so, or you can set *exog\_infl* to an entirely different set of regression variables. The below line builds and trains the ZIP model on our training data set in a single line of code. ``` zip_training_results = sm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, inflation='logit').fit() ``` Print the training summary: ``` print(zip_training_results.summary()) ``` Here is the training summary (I have highlighted the important elements in the output): ![Training summary of the ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/0486e-14lpdkhflukypa6hcppti3w.png?w=768) Training summary of the ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) ### Interpreting the training output The blue box contains information about variables that the nested Logistic Regression model has used to estimate the probability ***Ļ•*** of whether or not any fish were caught by a camping group**.** ![Regression coefficients, their standard errors and z-scores from the fitted ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/ac652-1v1qbcox-g6nctf2gycki0g.png?w=768) Regression coefficients, their standard errors and z-scores from the fitted ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Notice that the Logistic regression model did not find Intercept, LIVE\_BAIT and CAMPER variables useful for estimating ***Ļ•***. Their regression coefficients were found to be NOT statistically significant at the 95% confidence level, as indicated by the respective *p* values: *inflate\_Intercept=****0\.425****, inflate\_LIVE\_BAIT=****0\.680*** *and inflate\_CAMPER=****0\.240,*** which are all greater than 0.05 (i.e. 5% error threshold). #### Observation 1 The only two variables that the Logistic Regression model determined as useful for estimating the probability of whether or not any fish were caught were CHILDREN and PERSONS. #### Observation 2 The regression coefficient of PERSONS is negative (inflate\_PERSONS **\-1.2193**) which means that as the number of people in the camping group increases, probability of no fish being caught by that group decreases. This is in line with our intuition. The red box contains information about variables that the parent Poisson model used to estimate FISH\_COUNT on the condition that FISH\_COUNT \> 0. ![](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/0834a-1fgfarynmflggc81yyyoklg.png?w=768) (Image by [Author](https://www.linkedin.com/in/sachindate/)) #### Observation 3 We can see that the coefficients for all 5 regression variables are statistically significant at a 99% confidence level, as evidenced by their p value which is less than 0.01. In fact, the p value is less than 0.001 for all 5 variables, hence it is showing up as 0.000. #### Observation 4 The coefficient for CHILDREN is negative (CHILDREN -1.0810), meaning that as the number of children in the camping group goes up, the number of fish caught by that group goes down\! #### Observation 5 The Maximized Log-Likelihood of this model is -566.43. This value is useful for comparing the goodness-of-fit of the model with that of other models. #### Observation 6 Finally, note that the training algorithm of the ZIP model was not able to converge on the training data set as indicated by the following: ![](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/e1017-1uggteo4baudrczcpke81ia.png?w=768) (Image by [Author](https://www.linkedin.com/in/sachindate/)) If it had converged, perhaps it would have resulted in a better fit. We could try to fix that by passing in a *maxiter*\=100 parameter into the fit() method. *** ### Prediction We’ll get the ZIP model’s predictions on the test data set and calculate the root mean square error w.r.t. the actual values: ``` zip_predictions = zip_training_results.predict(X_test,exog_infl=X_test) predicted_counts=np.round(zip_predictions) actual_counts = y_test[dep_var] print('ZIP RMSE='+str(np.sqrt(np.sum(np.power(np.subtract(predicted_counts,actual_counts),2))))) ``` ``` >> ZIP RMSE=55.65069631190611 ``` Let’s plot the predicted versus actual fish counts: ``` fig = plt.figure() fig.suptitle('Predicted versus actual counts using the ZIP model') predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted') actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual') plt.legend(handles=[predicted, actual]) plt.show() ``` We see the following plot: ![Predicted versus actual fish caught](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/17f41-1pdq2dob7nh1yecxalfk4za.png?w=768) Predicted versus actual fish caught (Image by [Author](https://www.linkedin.com/in/sachindate/)) This completes our look at the Zero-Inflated Poisson regression model. *** ## Citations and Copyrights ### Papers Lambert, Diane. ā€œZero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.ā€ *Technometrics, vol. 34, no. 1*, 1992, pp. 1–14. JSTOR, [www.jstor.org/stable/1269547](http://www.jstor.org/stable/1269547). ### Books Cameron A. C. and Trivedi P. K., [Regression Analysis of Count Data](http://faculty.econ.ucdavis.edu/faculty/cameron/racd2/), Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013. ### Images All images are copyright [Sachin Date](https://www.linkedin.com/in/sachindate/) under [CC-BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/), unless a different source and copyright are mentioned underneath the image. *** **PREVIOUS:** [The Generalized Poisson Regression Model](https://timeseriesreasoning.com/contents/generalized-poisson-regression-model/) **NEXT:** [Fitting Linear Regression Models on Count Based Data Sets](https://timeseriesreasoning.com/contents/linear-regression-models-for-count-based-data-sets/) *** **UP:**[Table of Contents](https://timeseriesreasoning.com/) *** ## Related Topics - [The Generalized Poisson Regression Model![Signage for the bike route to the Brooklyn Bridge](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2024/05/HERO_black-and-white-road-white-bridge-street-bicycle-275716-pxhere.jpg?fit=1024%2C512&ssl=1&resize=350%2C200)](https://timeseriesreasoning.com/contents/generalized-poisson-regression-model/) - [The Poisson Regression Model![Daily bicycle counts on the Brooklyn bridge](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2024/05/HERO_fig1.png?fit=786%2C396&ssl=1&resize=350%2C200)](https://timeseriesreasoning.com/contents/poisson-regression-model/) - [The Auto-Regressive Poisson Model](https://timeseriesreasoning.com/contents/poisson-regression-models-for-time-series-data/) *** ## Trending Pages [![Table of Contents](https://timeseriesreasoning.com/wp-content/uploads/2023/05/candy-g2cd70f5c2_1920.jpg)](https://timeseriesreasoning.com/)[Table of Contents](https://timeseriesreasoning.com/)January 8, 2022Sachin Date [![The Negative Binomial Regression Model](https://timeseriesreasoning.com/wp-content/uploads/2024/05/HERO_fig1.png)](https://timeseriesreasoning.com/contents/negative-binomial-regression-model/)[The Negative Binomial Regression Model](https://timeseriesreasoning.com/contents/negative-binomial-regression-model/)June 8, 2021Sachin Date [![The Fixed Effects Regression Model For Panel Data Sets](https://timeseriesreasoning.com/wp-content/uploads/2024/05/HERO_Fig7.png)](https://timeseriesreasoning.com/contents/the-fixed-effects-regression-model-for-panel-data-sets/)[The Fixed Effects Regression Model For Panel Data Sets](https://timeseriesreasoning.com/contents/the-fixed-effects-regression-model-for-panel-data-sets/)March 26, 2022Sachin Date *** Privacy & Cookies: This site uses cookies. By continuing to use this website, you agree to their use. To find out more, including how to control cookies, see here: [Cookie Policy](https://automattic.com/cookies/) [![Table of Contents](https://timeseriesreasoning.com/wp-content/uploads/2023/05/candy-g2cd70f5c2_1920.jpg)](https://timeseriesreasoning.com/)[Table of Contents](https://timeseriesreasoning.com/) [![The Negative Binomial Regression Model](https://timeseriesreasoning.com/wp-content/uploads/2024/05/HERO_fig1.png)](https://timeseriesreasoning.com/contents/negative-binomial-regression-model/)[The Negative Binomial Regression Model](https://timeseriesreasoning.com/contents/negative-binomial-regression-model/) [![The Fixed Effects Regression Model For Panel Data Sets](https://timeseriesreasoning.com/wp-content/uploads/2024/05/HERO_Fig7.png)](https://timeseriesreasoning.com/contents/the-fixed-effects-regression-model-for-panel-data-sets/)[The Fixed Effects Regression Model For Panel Data Sets](https://timeseriesreasoning.com/contents/the-fixed-effects-regression-model-for-panel-data-sets/) ### Subscribe via Email [Scroll to Top](https://timeseriesreasoning.com/contents/zero-inflated-poisson-regression-model/#masthead)
Readable Markdown
###### Introduction to the ZIP model and a Python tutorial on training a **ZIP model** on a dataset having excess zeroes *** In this section, we’ll learn how to build a regression model for **counts based datasets** in which the dependent variable contains **an excess of zero-valued data**. **Counts datasets** are ones where the dependent variable is an event such as: - Number of vehicles crossing an intersection per hour. - Number of ER visits happening each month - Number of motor vehicle insurance claims filed per year - Number of defects found in a mass produced printed circuit board. ![Data set containing many zero counts ](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/e524c-14sguqptvctofho-1_gqf1g.png?w=768&ssl=1) Data set containing many zero counts (Image by [Author](https://www.linkedin.com/in/sachindate/)) Many real world phenomena produce counts that are almost always zero. For example: - Number of times a machine fails each month - Number of exoplanets discovered each year - The number of billionaires living in every single city in the world. Such data are hard to deal with using traditional models for counts data such as the [**Poisson**](https://timeseriesreasoning.com/contents/poisson-regression-model/), the [**Binomial**](https://towardsdatascience.com/the-binomial-regression-model-everything-you-need-to-know-5216f1a483d3) or the [**Negative Binomial**](https://timeseriesreasoning.com/contents/negative-binomial-regression-model/) regression models. This is because such data sets **contain more number of zero valued counts than what one would expect to observe using the traditional model’s probability distribution**. For example, if you assume that a phenomenon obeys the following *Poisson(5)* process, you would expect to see zero counts no more than 0.67% of the time: ![A Poisson(5) process will generate zeros in about 0.67% of observations](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/11386-18vpaxiswoxqaupc_jc8zoq.png?w=768) A Poisson(5) process will generate zeros in about 0.67% of observations (Image by [Author](https://www.linkedin.com/in/sachindate/)) If you observe zero counts far more often than that, the data set contains **an excess of zeroes.** If you use a standard Poisson or Binomial or NB regression model on such data sets, it can fit badly and will generate poor quality predictions, no matter how much you tweak its parameters. So what is a modeler to do when faced with such data with excess zeros? *** Fortunately, there is a way to modify a standard counts model such as Poisson or Negative Binomial to account for the presence of the extra zeroes. In fact, there happen to be at least two ways to do this. One technique is known as the **Hurdle model** and the second technique is known the **Zero-Inflated model**. In this section, we’ll look at the zero-inflated regression model in some detail. Specifically, we’ll focus on the **Zero Inflated Poisson regression model**, often referred to as the **ZIP model**. ### The structure of a ZIP model Let’s briefly look at the structure of a regular Poisson model before we see how its structure is modified to handle excess zero counts. Imagine a data set containing *n* samples and *p* regression variables per sample. Therefore, the regression variables ***X*** can be represented by a matrix of size *(n x p)* and each row ***x\_i*** in the ***X*** matrix is a vector of size *(1 x p)* corresponding the dependent variable value *y\_i*: ![A data set (y, X) in matrix notation](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/e1404-1id8hghqgszeay_3r-zphga.png?w=768) A data set (***y, X***) in matrix notation (Image by [Author](https://www.linkedin.com/in/sachindate/)) If we assume that ***y*** is a Poisson distributed random variable, we can build a Poisson regression model for this data set. The Poisson model is made up of two parts: 1. A Poisson **P**robability **M**ass **F**unction (PMF) denoted as *P(y\_i=k)* used to calculate the probability of observing *k* events in any unit interval given a mean event rate of Ī» events / unit time. 2. A link function that is used to express the mean rate Ī»as a function of the regression variables ***X*.** This is illustrated in the figure below: ![Probability Mass Function of the standard Poisson regression model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/180c1-1ci0iqsqnzgzfrmil9akjza.png?w=768) Probability Mass Function of the standard Poisson regression model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Normally, we assume that there is some underlying process that is producing the observed counts as per the Poisson *PMF: P(y\_i=k)*. The intuition behind the Zero Inflated Poisson model is that ***there is a second underlying process that is determining whether a count is zero or non-zero***. Once a count is determined to be non-zero, the regular Poisson process takes over to determine its actual non-zero value based on the Poisson process’s PMF. Thus, a **ZIP** regression model consists three parts: 1. A PMF *P(y\_i=0)* which is used to calculate the probability of observing a zero count. 2. A second PMF *P(y\_i=k)* which is used to calculate the probability of observing *k* events, *given that k \> 0*. 3. A link function that is used to express the mean rate Ī»as a function of the regression variables ***X*.** This is illustrated in the following figure: ![Probability Mass Function of the ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/f782f-1q9dtw-96auhefd920nv3-w.png?w=768) Probability Mass Function of the ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) As before, *y\_i* is the random variable that denotes the observed count corresponding to the regression variables row ***x\_i****\=\[x\_i1, x\_i2, x\_i3,…,x\_ip\].* *Ļ•\_i* is a measure of the proportion of excess zeroes corresponding to the ith row ***(****y\_i****, x\_i)*** in the data set; . ### Getting to know Ļ•\_i A simple way to understand *Ļ•\_i* is as follows: Imagine that you take 1000 observations of *y\_i*, each one with the ***same*** combination of regression variable values ***x\_i****\=\[x\_i1, x\_i2, x\_i3,…,x\_ip\].* Since *y\_i is* a *random variable* that follows the Poisson distribution, you may see a different value of *y\_i* in each one of the 1000 observations. Suppose that out of the 1000 *y\_i* values you observe, you observe 874 zero values. You determine that out of these 874 zero values, the regular Poisson distribution that you have assumed for *y\_i*, will be able to explain only up to 7 zero values. So the remaining 867 zero values are excess zero observations. So for the *ith* row in your data set, *Ļ•\_i* =867/1000 = 0.867. When the data set does not have any excess zeroes in the dependent variable, the value of ***Ļ•*** works out to be zero and the PMF of the ZIP model reduces to the PMF of the standard Poisson model (you can easily verify this by setting ***Ļ•*** to 0 in the ZIP model’s PMF). ### How to estimate *Ļ•?* So how can we estimate the value of *Ļ•\_i*? A simple and crude way of estimating *Ļ•\_i* is by setting each ***Ļ•****\_i* to the following ratio: ![A simple but inaccurate way to estimate Ļ•\_i in the ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/68260-1brs2o3bt5bzz1hjs9sctcq.png?w=768) A simple but inaccurate way to estimate ***Ļ•***\_i in the ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Perhaps a more realistic way of calculating *Ļ•\_i* is by estimating it as a function of regression variables ***X***. This is usually done by transforming the ***y*** variable to a binary 0/1 random variable ***y’*** (***y\_prime***) which takes the value 0 if the underlying ***y*** is 0, and 1 in all other cases. Then we fit a **Logistic regression model** on the transformed ***y’***. We then train the Logistic regression model on the data set \[***X, y’***\] and it yields a vector of fitted probabilities ***µ\_fitted****\=\[µ\_1, µ\_2, µ\_3,…,µ\_n\],* (because that’s what a Logistic regression model does)*.* Once we get the ***µ\_fitted*** *vector, we simply set it to the* ***vector Ļ•.*** *Thus \[Ļ•\_1=µ\_1, Ļ•\_2=µ\_2, Ļ•\_3=µ\_3,…,Ļ•\_n=µ\_n\]*. The above process of estimating ***Ļ•*** is illustrated below*:* ![The training sequence for estimating excess zeros parameter Ļ• in a ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/f2e9b-1elxe7t8crnog3x7teohnlq.png?w=768) The training sequence for estimating *excess zeros parameter* ***Ļ•*** in a ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Once the ***Ļ•*** vector is estimated, we plug it into the probability functions of the ZIP model and use what is known as the **M**aximum **L**ikelihood **E**stimation (**MLE**) technique to train the ZIP model on the data set with excess counts. The following figure illustrates the training sequence of the **ZIP** model: ![Training sequence of the ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/2c45f-1pynehn4r7fmmc5bf_yi4cw.png?w=768) Training sequence of the ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Thankfully, there are many statistics packages that automate this entire procedure of estimating ***Ļ•*** and using the estimated ***Ļ•*** to train the ZIP model using the MLE technique on your data set. In the rest of this section, we’ll use the Python [statsmodels](https://www.statsmodels.org/stable/index.html) library to build and train a ZIP model in a single line of code. *** ## How to train the ZIP model using Python In our Python tutorial on the ZIP model, we’ll use a data set of camping trips taken by 250 groups of people: ![The camping trips data set](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/1db1e-1u6wyuqdxda2kumuun5pntg.png?w=768) The camping trips data set (Image by [Author](https://www.linkedin.com/in/sachindate/)) The [data set is available here](https://gist.github.com/sachinsdate/09cfd42b7701c48ec68b04c786786434). Here are a couple of salient features of this data set: - The campers may or may not have done some fishing during their trip. - If a group did some fishing, they would have caught zero or more fish. - We want to estimate not only how many fish were caught (if there was fishing done by a camping group), but also the probability that the camping group caught any fish at all. Thus, there are two distinct data generation processes involved: 1. A process that determines whether or not a camping group indulged in a successful fishing activity: The ZIP model will internally use a Logistic Regression model that was explained earlier to model this binary process. 2. A second process that determines how many fish were caught by a camping group, given that there was at least one fish caught by the group: The ZIP model will use a regular Poisson model for modeling this second process. **Variables in the data set** The camping trips data set contains the following variables: **FISH\_COUNT:** The number of fish that were caught. This will be our dependent variable ***y***. **LIVE\_BAIT:** A binary variable indicating whether live bait was used. **CAMPER:** Whether the fishing group used a camper van. **PERSONS:** Total number of people in the fishing group. Note that in some groups, none of them may have fished. **CHILDREN:** The number of children in the camping group. Here is a frequency distribution of the dependent FISH\_COUNT variable: ![Frequency distribution of FISH\_COUNT](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/74fe7-1kdio-becuqax7nzkkrowug.png?w=768) Frequency distribution of FISH\_COUNT (Image by [Author](https://www.linkedin.com/in/sachindate/)) As we can see, there *may* be excess zeroes in this data set. We’ll train a ZIP model on this data set to test this theory and hopefully achieve a better fit than the regular Poisson model. ### Regression Goal Our regression goals on this data set are as follows: Predict the number of fish caught (FISH\_COUNT) by a camping group based on the values of LIVE\_BAIT, CAMPER, PERSONS and CHILDREN variables. ### Regression Strategy Our regression strategy will be as follows: 1. FISH\_COUNT will be the dependent variable ***y***, and \[LIVE\_BAIT, CAMPER, PERSONS and CHILDREN\] will be the explanatory variables ***X***. 2. We’ll use the Python ***statsmodels library*** to train the ZIP regression model on the (***y, X***) data set. 3. We’ll make some predictions using the ZIP model on a test data set that the model has not seen during its training. Let’s begin by import all the required packages: ``` import pandas as pd from patsy import dmatrices import numpy as np import statsmodels.api as sm import matplotlib.pyplot as plt ``` Next, we’ll load the fish data set into memory. Here is the [link to the data set](https://gist.github.com/sachinsdate/09cfd42b7701c48ec68b04c786786434): ``` df = pd.read_csv('fish.csv', header=0) ``` Let’s print the top few rows of the data set: ``` print(df.head(10)) ``` ![Top 10 rows of the data set](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/510be-1kc8fohge7guekbfolhyekg.png?w=768) Top 10 rows of the data set (Image by [Author](https://www.linkedin.com/in/sachindate/)) Let’s also print out the frequency distribution of FISH\_COUNT values: ``` df.groupby('FISH_COUNT').count() ``` ![Frequency distribution of fish counts](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/559f8-1lph-kmazbzfjhljz6yp0iq.png?w=768) Frequency distribution of fish counts (Image by [Author](https://www.linkedin.com/in/sachindate/)) Create the training and test data sets. Note that for now, we are not doing a stratified random split: ``` mask = np.random.rand(len(df)) < 0.8 df_train = df[mask] df_test = df[~mask] print('Training data set length='+str(len(df_train))) print('Testing data set length='+str(len(df_test))) ``` ``` >> Training data set length=196 >> Testing data set length=54 ``` Setup the regression expression in [**Patsy**](https://patsy.readthedocs.io/en/latest/quickstart.html)notation. We are telling Patsy that FISH\_COUNT is our dependent variable ***y*** and it depends on the regression variables LIVE\_BAIT, CAMPER, PERSONS and CHILDREN: ``` expr = 'FISH_COUNT ~ LIVE_BAIT + CAMPER + CHILDREN + PERSONS' ``` Let’s use Patsy to carve out the ***X*** and ***y*** matrices for the training and testing data sets. ``` y_train, X_train = dmatrices(expr, df_train, return_type='dataframe') y_test, X_test = dmatrices(expr, df_test, return_type='dataframe') ``` Using statsmodels’s [**ZeroInflatedPoisson**](https://www.statsmodels.org/stable/generated/statsmodels.discrete.count_model.ZeroInflatedPoisson.html) class, let’s build and train a ZIP regression model on the training data set. But before we do so, let me explain how to use two parameters that the class constructor takes: - ***inflation:*** The *ZeroInflatedPoisson* model class will internally use a *LogisticRegression* model to estimate the parameter ***Ļ•***. Hence we set the model parameter *inflation to ’logit*’. We can also experiment with setting it to other Binomial link functions such as ā€˜probit’. - ***exog\_infl:*** We also want to ask the ZIP model to estimate *Ļ•* as a function of the same set of regression variables as the parent model, namely: LIVE\_BAIT, CAMPER, PERSONS and CHILDREN. Hence we set the parameter *exog\_infl* to X\_train. If you want to use only a subset of X\_train, you can do so, or you can set *exog\_infl* to an entirely different set of regression variables. The below line builds and trains the ZIP model on our training data set in a single line of code. ``` zip_training_results = sm.ZeroInflatedPoisson(endog=y_train, exog=X_train, exog_infl=X_train, inflation='logit').fit() ``` Print the training summary: ``` print(zip_training_results.summary()) ``` Here is the training summary (I have highlighted the important elements in the output): ![Training summary of the ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/0486e-14lpdkhflukypa6hcppti3w.png?w=768) Training summary of the ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) ### Interpreting the training output The blue box contains information about variables that the nested Logistic Regression model has used to estimate the probability ***Ļ•*** of whether or not any fish were caught by a camping group**.** ![Regression coefficients, their standard errors and z-scores from the fitted ZIP model](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/ac652-1v1qbcox-g6nctf2gycki0g.png?w=768) Regression coefficients, their standard errors and z-scores from the fitted ZIP model (Image by [Author](https://www.linkedin.com/in/sachindate/)) Notice that the Logistic regression model did not find Intercept, LIVE\_BAIT and CAMPER variables useful for estimating ***Ļ•***. Their regression coefficients were found to be NOT statistically significant at the 95% confidence level, as indicated by the respective *p* values: *inflate\_Intercept=****0\.425****, inflate\_LIVE\_BAIT=****0\.680*** *and inflate\_CAMPER=****0\.240,*** which are all greater than 0.05 (i.e. 5% error threshold). #### Observation 1 The only two variables that the Logistic Regression model determined as useful for estimating the probability of whether or not any fish were caught were CHILDREN and PERSONS. #### Observation 2 The regression coefficient of PERSONS is negative (inflate\_PERSONS **\-1.2193**) which means that as the number of people in the camping group increases, probability of no fish being caught by that group decreases. This is in line with our intuition. The red box contains information about variables that the parent Poisson model used to estimate FISH\_COUNT on the condition that FISH\_COUNT \> 0. #### Observation 3 We can see that the coefficients for all 5 regression variables are statistically significant at a 99% confidence level, as evidenced by their p value which is less than 0.01. In fact, the p value is less than 0.001 for all 5 variables, hence it is showing up as 0.000. #### Observation 4 The coefficient for CHILDREN is negative (CHILDREN -1.0810), meaning that as the number of children in the camping group goes up, the number of fish caught by that group goes down\! #### Observation 5 The Maximized Log-Likelihood of this model is -566.43. This value is useful for comparing the goodness-of-fit of the model with that of other models. #### Observation 6 Finally, note that the training algorithm of the ZIP model was not able to converge on the training data set as indicated by the following: If it had converged, perhaps it would have resulted in a better fit. We could try to fix that by passing in a *maxiter*\=100 parameter into the fit() method. *** ### Prediction We’ll get the ZIP model’s predictions on the test data set and calculate the root mean square error w.r.t. the actual values: ``` zip_predictions = zip_training_results.predict(X_test,exog_infl=X_test) predicted_counts=np.round(zip_predictions) actual_counts = y_test[dep_var] print('ZIP RMSE='+str(np.sqrt(np.sum(np.power(np.subtract(predicted_counts,actual_counts),2))))) ``` ``` >> ZIP RMSE=55.65069631190611 ``` Let’s plot the predicted versus actual fish counts: ``` fig = plt.figure() fig.suptitle('Predicted versus actual counts using the ZIP model') predicted, = plt.plot(X_test.index, predicted_counts, 'go-', label='Predicted') actual, = plt.plot(X_test.index, actual_counts, 'ro-', label='Actual') plt.legend(handles=[predicted, actual]) plt.show() ``` We see the following plot: ![Predicted versus actual fish caught](https://i0.wp.com/timeseriesreasoning.com/wp-content/uploads/2021/06/17f41-1pdq2dob7nh1yecxalfk4za.png?w=768) Predicted versus actual fish caught (Image by [Author](https://www.linkedin.com/in/sachindate/)) This completes our look at the Zero-Inflated Poisson regression model. *** ## Citations and Copyrights ### Papers Lambert, Diane. ā€œZero-Inflated Poisson Regression, with an Application to Defects in Manufacturing.ā€ *Technometrics, vol. 34, no. 1*, 1992, pp. 1–14. JSTOR, [www.jstor.org/stable/1269547](http://www.jstor.org/stable/1269547). ### Books Cameron A. C. and Trivedi P. K., [Regression Analysis of Count Data](http://faculty.econ.ucdavis.edu/faculty/cameron/racd2/), Second Edition, Econometric Society Monograph No. 53, Cambridge University Press, Cambridge, May 2013. ### Images All images are copyright [Sachin Date](https://www.linkedin.com/in/sachindate/) under [CC-BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/), unless a different source and copyright are mentioned underneath the image. *** **PREVIOUS:** [The Generalized Poisson Regression Model](https://timeseriesreasoning.com/contents/generalized-poisson-regression-model/) **NEXT:** [Fitting Linear Regression Models on Count Based Data Sets](https://timeseriesreasoning.com/contents/linear-regression-models-for-count-based-data-sets/) *** **UP:**[Table of Contents](https://timeseriesreasoning.com/) ***
Shard159 (laksa)
Root Hash16432218469320765959
Unparsed URLcom,timeseriesreasoning!/contents/zero-inflated-poisson-regression-model/ s443