🕷️ Crawler Inspector

URL Lookup

Direct Parameter Lookup

Raw Queries and Responses

1. Shard Calculation

Query:

Response:

Calculated Shard: 166 (from laksa177)

2. Crawled Status Check

Query:

curl -X POST \
  'http://laksa166.int.ahrefs:8124/' \
  -H 'Content-Type: text/plain' \
  -H 'X-ClickHouse-Database: crawler3' \
  -H 'Authorization: Basic YXBpOg==' \
  -d 'SELECT getAhrefsURLFromUnparsed(src_unparsed) AS found_url, ifNull(toUnixTimestamp(download_stamp), 0) AS crawl_time, ifNull(toUnixTimestamp(props_url_first_seen), 0) AS first_indexed_time, download_http_code AS http_code, src_unparsed AS src_unparsed, src_root_hash AS src_root_hash, history_drop_reason AS history_drop_reason, meta_title AS meta_title, meta_descriptions AS meta_descriptions, attrs_boilerpipe_text AS attrs_boilerpipe_text, attrs_markdown AS attrs_markdown, attrs_readable_markdown AS attrs_readable_markdown, meta_canonical AS meta_canonical, ml_categories_json AS ml_categories_json, ml_types_json AS ml_types_json, ml_intent_types_json AS ml_intent_types_json, meta_language AS meta_language, attrs_author AS attrs_author, ifNull(toUnixTimestamp(attrs_publish_time), 0) AS attrs_publish_time, ifNull(toUnixTimestamp(attrs_original_publish_time), 0) AS attrs_original_publish_time, ifNull(attrs_is_republished, 0) AS attrs_is_republished, ifNull(attrs_nr_words, 0) AS attrs_nr_words, ifNull(attrs_boilerpipe_nr_words, 0) AS attrs_boilerpipe_nr_words, ifNull(body_ext_links_number, 0) AS body_ext_links_number, ifNull(body_int_links_number, 0) AS body_int_links_number, ifNull(meta_nofollow, 0) AS meta_nofollow, ifNull(meta_noarchive, 0) AS meta_noarchive, ifNull(props_was_rendered, 0) AS props_was_rendered, ifNull(src_redirect, \'\') AS src_redirect, ifNull(download_time_msec, 0) AS download_time_msec, ifNull(download_ttfb_msec, 0) AS download_ttfb_msec, ifNull(download_size, 0) AS download_size FROM crawler3.page_info_local FINAL PREWHERE (src_root_hash, src_unparsed) IN ((getAhrefsRootHashFromUnparsed(getAhrefsUnparsedNoserviceFromURL(\'https://aosmith.rbind.io/2019/03/06/lots-of-zeros/\')), getAhrefsUnparsedNoserviceFromURL(\'https://aosmith.rbind.io/2019/03/06/lots-of-zeros/\'))) FORMAT JSONEachRow'

Response:

{"found_url":"https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/","crawl_time":1777339117,"first_indexed_time":1551904677,"http_code":200,"src_unparsed":"io,rbind!aosmith,\/2019\/03\/06\/lots-of-zeros\/ s443","src_root_hash":"13779606198685285766","history_drop_reason":null,"meta_title":"Lots of zeros or too many zeros?: Thinking about zero inflation in count data","meta_descriptions":["When working with counts, having many zeros does not necessarily indicate zero inflation. I demonstrate this by simulating data from the negative binomial and generalized Poisson distributions. I then show one way to check if the data has excess zeros compared to the number of zeros expected based on the model."],"attrs_boilerpipe_text":"In a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation.\nZero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros!\nTable of Contents\nLoad packages and dataset\nNegative binomial with many zeros\nGeneralized Poisson with many zeros\nLots of zeros or excess zeros?\nSimulate negative binomial data\nChecking for excess zeros\nAn example with excess zeros\nJust the code, please\nLoad packages and dataset\nI’m going to be simulating counts from different distributions to demonstrate this. First I’ll load the packages I’m using today.\nPackage\nHMMpa\nis for a function to draw random samples from the generalized Poisson distribution.\nlibrary(ggplot2) # v. 3.1.0\nlibrary(HMMpa) # v. 1.0.1\nlibrary(MASS) # v. 7.3-51.1\nNegative binomial with many zeros\nFirst I’ll draw 200 counts from a negative binomial with a mean (\n\$\\lambda\$\n) of\n\$10\$\nand\n\$\\theta = 0.05\$\n.\nR uses the parameterization of the negative binomial where the variance of the distribution is\n\$\\lambda + (\\lambda^2\/\\theta)\$\n. In this parameterization, as\n\$\\theta\$\ngets small the variance gets big. Using a very small value of theta like I am will generally mean the distribution of counts will have many zeros as well as a few large counts\nI pull a random sample of size 200 from this distribution using\nrnbinom()\n. The\nmu\nargument is the mean and the\nsize\nargument is theta.\nset.seed(16)\ndat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )\nBelow is a histogram of these data. I’ve annotated the plot with the proportion of the 200 values that are 0 as well as the maximum observed count in the dataset. There are lots of zeros! But these data are not zero-inflated because we expect to have many 0 values under this particular distribution.\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Negative binomial\",\n         subtitle = \"mean = 10, theta = 0.05\" ) +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 150, y = 100, size = 8)\nGeneralized Poisson with many zeros\nI don’t know the generalized Poisson distribution well, although it appears to be regularly used in some fields. For whatever reason, the negative binomial seems much more common in ecology. 🤷\nFrom my understanding, the generalized Poisson distribution can have heavier tails than the negative binomial. This would mean that it can have more extreme maximum counts as well as lots of zeros.\nSee the documentation for\nrgenpois()\nfor the formula for the density of the generalized Poisson and definitions of mean and variance. Note that when\nlambda2\nis 0, the generalized Poisson reduces to the Poisson.\nset.seed(16)\ndat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )\nBelow is a histogram of these data. Just over 50% of the values are zeros but the maximum count is over 1000! 💥\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Generalized Poisson\",\n         subtitle = \"lambda1 = 0.5, lambda2 = 0.95\") +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 600, y = 100, size = 8)\nLots of zeros or excess zeros?\nAll the simulations above show us is that some distributions\ncan\nhave a lot of zeros. In any given scenario, though, how do we check if we have\nexcess\nzeros? Having excess zeros means there are more zeros than expected by the distribution we are using for modeling. If we have excess zeros than we may either need a different distribution to model the data or we could think about models that specifically address zero inflation.\nThe key to checking for excess zeros is to estimate the number of zeros you would expect to see if the fitted model were truly the model that created your data and compare that to the number of zeros in the actual data. If there are many more zeros in the data than the model allows for then you have zero inflation compared to whatever distribution you are using.\nSimulate negative binomial data\nI’ll now simulate data based on a negative binomial model with a single, continuous explanatory variable. I’ll use a model fit to these data to show how to check for excess zeros.\nSince this is a generalized linear model, I first calculate the means based on the linear predictor. The exponentiation is due to using the natural log link to\nlink\nthe mean to the linear predictor.\nset.seed(16)\nx = runif(200, 5, 10) # simulate explanatory variable\nb0 = 1 # set value of intercept\nb1 = 0.25 # set value of slope\nmeans = exp(b0 + b1*x) # calculate true means\ntheta = 0.25 # true theta\nI can use these true means along with my chosen value of\ntheta\nto simulate data from the negative binomial distribution.\ny = rnbinom(200, mu = means, size = theta)\nNow that I’ve made some data I can fit a model. Since I’m using a negative binomial GLM with\nx\nas the explanatory variable, which is how I created the data, this model should work well. The\nglm.nb()\nfunction is from package\nMASS\n.\nfit1 = glm.nb(y ~ x)\nIn this exercise I’m going to go directly to checking for excess zeros. This means I’m skipping other important checks of model fit, such as checks for overdispersion and examining residual plots. Don’t skip these in a real analysis; having excess zeros certainly isn’t the only problem we can run into with count data.\nChecking for excess zeros\nThe observed data has 76 zeros (out of 200).\nsum(y == 0)\n# [1] 76\nHow many zeros is expected given the model? I need the model estimated means and theta to answer this question. I can get the means via\npredict()\nand I can pull\ntheta\nout of the model\nsummary()\n.\npreds = predict(fit1, type = \"response\") # estimated means\nesttheta = summary(fit1)$theta # estimated theta\nFor discrete distributions like the negative binomial, the\ndensity\ndistribution functions (which start with the letter “d”) return the probability that the observation is equal to a given value. This means I can use\ndnbinom()\nto calculate the probability of an observation being 0 for every row in the dataset. To do this I need to provide values for the parameters of the distribution of each observation.\nBased on the model, the distribution of each observation is negative binomial with the mean estimated from the model and the overall estimated theta.\nprop0 = dnbinom(x = 0, mu = preds, size = esttheta )\nThe sum of these probabilities is an estimate of the number of zero values expected by the model (see\nhere\nfor another example). I’ll round this to the nearest integer.\nround( sum(prop0) )\n# [1] 72\nThe expected number of 0 values is ~72, very close to the 76 observed in the data. This is no big surprise, since I fit the same model that I used to create the data.\nAn example with excess zeros\nThe example above demonstrates a model without excess zeros. Let me finish by fitting a model to data that has more zeros than expected by the distribution. This can be done by fitting a Poisson GLM instead of a negative binomial GLM to my simulated data.\nfit2 = glm(y ~ x, family = poisson)\nRemember the data contain 76 zeros.\nsum(y == 0)\n# [1] 76\nUsing\ndpois()\n, the number of zeros given be the Poisson model is 0. 😮 These data are zero-inflated compared to the Poisson distribution, and I clearly need a different approach for modeling these data.\nround( sum( dpois(x = 0,\n           lambda = predict(fit2, type = \"response\") ) ) )\n# [1] 0\nThis brings me back to my earlier point about checking model fit. If I had done other standard checks of model fit for\nfit2\nI would have seen additional problems that would indicate the Poisson distribution did not fit these data (such as severe overdispersion).\nJust the code, please\nHere’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code\nfrom here\n.\nlibrary(ggplot2) # v. 3.1.0\nlibrary(HMMpa) # v. 1.0.1\nlibrary(MASS) # v. 7.3-51.1\n\nset.seed(16)\ndat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )\n\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Negative binomial\",\n         subtitle = \"mean = 10, theta = 0.05\" ) +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 150, y = 100, size = 8)\n\nset.seed(16)\ndat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )\n\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Generalized Poisson\",\n         subtitle = \"lambda1 = 0.5, lambda2 = 0.95\") +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 600, y = 100, size = 8)\n\nset.seed(16)\nx = runif(200, 5, 10) # simulate explanatory variable\nb0 = 1 # set value of intercept\nb1 = 0.25 # set value of slope\nmeans = exp(b0 + b1*x) # calculate true means\ntheta = 0.25 # true theta\ny = rnbinom(200, mu = means, size = theta)\n\nfit1 = glm.nb(y ~ x)\n\nsum(y == 0)\n\npreds = predict(fit1, type = \"response\") # estimated means\nesttheta = summary(fit1)$theta # estimated theta\n\nprop0 = dnbinom(x = 0, mu = preds, size = esttheta )\nround( sum(prop0) )\n\nfit2 = glm(y ~ x, family = poisson)\nsum(y == 0)\n\nround( sum( dpois(x = 0,\n           lambda = predict(fit2, type = \"response\") ) ) )","attrs_markdown":"[Very statisticious](https:\/\/aosmith.rbind.io\/)\n\n[![](https:\/\/aosmith.rbind.io\/img\/main\/ao_ppg2.jpg)](https:\/\/aosmith.rbind.io\/#about)\n\n#### Ariel Muldoon\n##### I currently work as an applied statistician in aviation and aeronautics. In a previous role as a consulting statistician in academia I created and taught R workshops for applied science graduate students who are just getting started in R, where my goal was to make their transition to a programming language as smooth as possible. See these workshop materials at [my website](http:\/\/ariel.rbind.io\/).\n\n- [Home](https:\/\/aosmith.rbind.io\/)\n- [Tags](https:\/\/aosmith.rbind.io\/tags)\n- [About](https:\/\/aosmith.rbind.io\/#about)\n- [Resume](https:\/\/ariel.rbind.io\/files\/acm_resume.html)\n\n- [Email](mailto:ariel.muldoon@gmail.com)\n- [Twitter](https:\/\/twitter.com\/aosmith16)\n- [GitHub](https:\/\/github.com\/aosmith16)\n- [Stack Overflow](https:\/\/stackoverflow.com\/users\/2461552)\n- [RSS](https:\/\/aosmith.rbind.io\/index.xml)\n- [R Weekly](https:\/\/rweekly.org\/)\n- [R-bloggers](https:\/\/www.r-bloggers.com\/)\n\n# Lots of zeros or too many zeros?: Thinking about zero inflation in count data\nMarch 6, 2019\n\n·  [@aosmith16](https:\/\/twitter.com\/aosmith16)\\&nbsp ·  [View source](https:\/\/github.com\/aosmith16\/aosmith\/blob\/master\/content\/post\/2019-03-06-lots-of-zeros.Rmd)\\&nbsp\n\n[glmm](https:\/\/aosmith.rbind.io\/tags\/glmm), [simulation](https:\/\/aosmith.rbind.io\/tags\/simulation), [teaching](https:\/\/aosmith.rbind.io\/tags\/teaching)\n\nIn a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation.\n\nZero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros\\!\n\n## Table of Contents\n- [Load packages and dataset](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#load-packages-and-dataset)\n- [Negative binomial with many zeros](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#negative-binomial-with-many-zeros)\n- [Generalized Poisson with many zeros](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#generalized-poisson-with-many-zeros)\n- [Lots of zeros or excess zeros?](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#lots-of-zeros-or-excess-zeros)\n- [Simulate negative binomial data](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#simulate-negative-binomial-data)\n- [Checking for excess zeros](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#checking-for-excess-zeros)\n- [An example with excess zeros](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#an-example-with-excess-zeros)\n- [Just the code, please](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#just-the-code-please)\n\n# Load packages and dataset\nI’m going to be simulating counts from different distributions to demonstrate this. First I’ll load the packages I’m using today.\n\nPackage **HMMpa** is for a function to draw random samples from the generalized Poisson distribution.\n```\nlibrary(ggplot2) # v. 3.1.0\nlibrary(HMMpa) # v. 1.0.1\nlibrary(MASS) # v. 7.3-51.1\n```\n\n# Negative binomial with many zeros\nFirst I’ll draw 200 counts from a negative binomial with a mean (\\\$\\\\lambda\\\$) of \\\$10\\\$ and \\\$\\\\theta = 0.05\\\$.  \n R uses the parameterization of the negative binomial where the variance of the distribution is \\\$\\\\lambda + (\\\\lambda^2\/\\\\theta)\\\$. In this parameterization, as \\\$\\\\theta\\\$ gets small the variance gets big. Using a very small value of theta like I am will generally mean the distribution of counts will have many zeros as well as a few large counts\n\nI pull a random sample of size 200 from this distribution using `rnbinom()`. The `mu` argument is the mean and the `size` argument is theta.\n```\nset.seed(16)\ndat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )\n```\nBelow is a histogram of these data. I’ve annotated the plot with the proportion of the 200 values that are 0 as well as the maximum observed count in the dataset. There are lots of zeros! But these data are not zero-inflated because we expect to have many 0 values under this particular distribution.\n```\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Negative binomial\",\n         subtitle = \"mean = 10, theta = 0.05\" ) +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 150, y = 100, size = 8)\n```\n![](https:\/\/aosmith.rbind.io\/post\/2019-03-06-lots-of-zeros_files\/figure-html\/unnamed-chunk-3-1.png)\n\n# Generalized Poisson with many zeros\nI don’t know the generalized Poisson distribution well, although it appears to be regularly used in some fields. For whatever reason, the negative binomial seems much more common in ecology. 🤷\n\nFrom my understanding, the generalized Poisson distribution can have heavier tails than the negative binomial. This would mean that it can have more extreme maximum counts as well as lots of zeros.\n\nSee the documentation for `rgenpois()` for the formula for the density of the generalized Poisson and definitions of mean and variance. Note that when `lambda2` is 0, the generalized Poisson reduces to the Poisson.\n```\nset.seed(16)\ndat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )\n```\nBelow is a histogram of these data. Just over 50% of the values are zeros but the maximum count is over 1000! 💥\n```\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Generalized Poisson\",\n         subtitle = \"lambda1 = 0.5, lambda2 = 0.95\") +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 600, y = 100, size = 8)\n```\n![](https:\/\/aosmith.rbind.io\/post\/2019-03-06-lots-of-zeros_files\/figure-html\/unnamed-chunk-5-1.png)\n\n# Lots of zeros or excess zeros?\nAll the simulations above show us is that some distributions *can* have a lot of zeros. In any given scenario, though, how do we check if we have *excess* zeros? Having excess zeros means there are more zeros than expected by the distribution we are using for modeling. If we have excess zeros than we may either need a different distribution to model the data or we could think about models that specifically address zero inflation.\n\nThe key to checking for excess zeros is to estimate the number of zeros you would expect to see if the fitted model were truly the model that created your data and compare that to the number of zeros in the actual data. If there are many more zeros in the data than the model allows for then you have zero inflation compared to whatever distribution you are using.\n\n# Simulate negative binomial data\nI’ll now simulate data based on a negative binomial model with a single, continuous explanatory variable. I’ll use a model fit to these data to show how to check for excess zeros.\n\nSince this is a generalized linear model, I first calculate the means based on the linear predictor. The exponentiation is due to using the natural log link to *link* the mean to the linear predictor.\n```\nset.seed(16)\nx = runif(200, 5, 10) # simulate explanatory variable\nb0 = 1 # set value of intercept\nb1 = 0.25 # set value of slope\nmeans = exp(b0 + b1*x) # calculate true means\ntheta = 0.25 # true theta\n```\nI can use these true means along with my chosen value of `theta` to simulate data from the negative binomial distribution.\n```\ny = rnbinom(200, mu = means, size = theta)\n```\nNow that I’ve made some data I can fit a model. Since I’m using a negative binomial GLM with `x` as the explanatory variable, which is how I created the data, this model should work well. The `glm.nb()` function is from package **MASS**.\n```\nfit1 = glm.nb(y ~ x)\n```\nIn this exercise I’m going to go directly to checking for excess zeros. This means I’m skipping other important checks of model fit, such as checks for overdispersion and examining residual plots. Don’t skip these in a real analysis; having excess zeros certainly isn’t the only problem we can run into with count data.\n\n# Checking for excess zeros\nThe observed data has 76 zeros (out of 200).\n```\nsum(y == 0)\n```\n```\n# [1] 76\n```\nHow many zeros is expected given the model? I need the model estimated means and theta to answer this question. I can get the means via `predict()` and I can pull `theta` out of the model `summary()`.\n```\npreds = predict(fit1, type = \"response\") # estimated means\nesttheta = summary(fit1)$theta # estimated theta\n```\nFor discrete distributions like the negative binomial, the *density* distribution functions (which start with the letter “d”) return the probability that the observation is equal to a given value. This means I can use `dnbinom()` to calculate the probability of an observation being 0 for every row in the dataset. To do this I need to provide values for the parameters of the distribution of each observation.\n\nBased on the model, the distribution of each observation is negative binomial with the mean estimated from the model and the overall estimated theta.\n```\nprop0 = dnbinom(x = 0, mu = preds, size = esttheta )\n```\nThe sum of these probabilities is an estimate of the number of zero values expected by the model (see [here](https:\/\/data.library.virginia.edu\/getting-started-with-hurdle-models\/) for another example). I’ll round this to the nearest integer.\n```\nround( sum(prop0) )\n```\n```\n# [1] 72\n```\nThe expected number of 0 values is ~72, very close to the 76 observed in the data. This is no big surprise, since I fit the same model that I used to create the data.\n\n# An example with excess zeros\nThe example above demonstrates a model without excess zeros. Let me finish by fitting a model to data that has more zeros than expected by the distribution. This can be done by fitting a Poisson GLM instead of a negative binomial GLM to my simulated data.\n```\nfit2 = glm(y ~ x, family = poisson)\n```\nRemember the data contain 76 zeros.\n```\nsum(y == 0)\n```\n```\n# [1] 76\n```\nUsing `dpois()`, the number of zeros given be the Poisson model is 0. 😮 These data are zero-inflated compared to the Poisson distribution, and I clearly need a different approach for modeling these data.\n```\nround( sum( dpois(x = 0,\n           lambda = predict(fit2, type = \"response\") ) ) )\n```\n```\n# [1] 0\n```\nThis brings me back to my earlier point about checking model fit. If I had done other standard checks of model fit for `fit2` I would have seen additional problems that would indicate the Poisson distribution did not fit these data (such as severe overdispersion).\n\n# Just the code, please\nHere’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](https:\/\/aosmith.rbind.io\/script\/2019-03-06-lots-of-zeros.R).\n```\nlibrary(ggplot2) # v. 3.1.0\nlibrary(HMMpa) # v. 1.0.1\nlibrary(MASS) # v. 7.3-51.1\n\nset.seed(16)\ndat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )\n\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Negative binomial\",\n         subtitle = \"mean = 10, theta = 0.05\" ) +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 150, y = 100, size = 8)\n\nset.seed(16)\ndat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )\n\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Generalized Poisson\",\n         subtitle = \"lambda1 = 0.5, lambda2 = 0.95\") +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 600, y = 100, size = 8)\n\nset.seed(16)\nx = runif(200, 5, 10) # simulate explanatory variable\nb0 = 1 # set value of intercept\nb1 = 0.25 # set value of slope\nmeans = exp(b0 + b1*x) # calculate true means\ntheta = 0.25 # true theta\ny = rnbinom(200, mu = means, size = theta)\n\nfit1 = glm.nb(y ~ x)\n\nsum(y == 0)\n\npreds = predict(fit1, type = \"response\") # estimated means\nesttheta = summary(fit1)$theta # estimated theta\n\nprop0 = dnbinom(x = 0, mu = preds, size = esttheta )\nround( sum(prop0) )\n\nfit2 = glm(y ~ x, family = poisson)\nsum(y == 0)\n\nround( sum( dpois(x = 0,\n           lambda = predict(fit2, type = \"response\") ) ) )\n```\n\n[glmm](https:\/\/aosmith.rbind.io\/tags\/glmm\/) [simulation](https:\/\/aosmith.rbind.io\/tags\/simulation\/) [teaching](https:\/\/aosmith.rbind.io\/tags\/teaching\/)\n\nPlease enable JavaScript to view the [comments powered by Disqus.](https:\/\/disqus.com\/?ref_noscript)\n\n© 2024 Ariel Muldoon [CC BY-SA 4.0](https:\/\/creativecommons.org\/licenses\/by-sa\/4.0\/)\n\n- [%!(EXTRA string=Facebook)](https:\/\/www.facebook.com\/sharer\/sharer.php?u=https%3A%2F%2Faosmith.rbind.io%2F2019%2F03%2F06%2Flots-of-zeros%2F)\n- [%!(EXTRA string=Twitter)](https:\/\/twitter.com\/intent\/tweet?text=https%3A%2F%2Faosmith.rbind.io%2F2019%2F03%2F06%2Flots-of-zeros%2F)\n\n![](https:\/\/aosmith.rbind.io\/img\/main\/ao_ppg2.jpg)\n\n#### Ariel Muldoon\nI currently work as an applied statistician in aviation and aeronautics. In a previous role as a consulting statistician in academia I created and taught R workshops for applied science graduate students who are just getting started in R, where my goal was to make their transition to a programming language as smooth as possible. See these workshop materials at [my website](http:\/\/ariel.rbind.io\/).","attrs_readable_markdown":"In a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation.\n\nZero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros\\!\n\n## Table of Contents\n- [Load packages and dataset](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#load-packages-and-dataset)\n- [Negative binomial with many zeros](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#negative-binomial-with-many-zeros)\n- [Generalized Poisson with many zeros](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#generalized-poisson-with-many-zeros)\n- [Lots of zeros or excess zeros?](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#lots-of-zeros-or-excess-zeros)\n- [Simulate negative binomial data](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#simulate-negative-binomial-data)\n- [Checking for excess zeros](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#checking-for-excess-zeros)\n- [An example with excess zeros](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#an-example-with-excess-zeros)\n- [Just the code, please](https:\/\/aosmith.rbind.io\/2019\/03\/06\/lots-of-zeros\/#just-the-code-please)\n\n## Load packages and dataset\nI’m going to be simulating counts from different distributions to demonstrate this. First I’ll load the packages I’m using today.\n\nPackage **HMMpa** is for a function to draw random samples from the generalized Poisson distribution.\n```\nlibrary(ggplot2) # v. 3.1.0\nlibrary(HMMpa) # v. 1.0.1\nlibrary(MASS) # v. 7.3-51.1\n```\n\n## Negative binomial with many zeros\nFirst I’ll draw 200 counts from a negative binomial with a mean (\\\$\\\\lambda\\\$) of \\\$10\\\$ and \\\$\\\\theta = 0.05\\\$.  \n R uses the parameterization of the negative binomial where the variance of the distribution is \\\$\\\\lambda + (\\\\lambda^2\/\\\\theta)\\\$. In this parameterization, as \\\$\\\\theta\\\$ gets small the variance gets big. Using a very small value of theta like I am will generally mean the distribution of counts will have many zeros as well as a few large counts\n\nI pull a random sample of size 200 from this distribution using `rnbinom()`. The `mu` argument is the mean and the `size` argument is theta.\n```\nset.seed(16)\ndat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )\n```\nBelow is a histogram of these data. I’ve annotated the plot with the proportion of the 200 values that are 0 as well as the maximum observed count in the dataset. There are lots of zeros! But these data are not zero-inflated because we expect to have many 0 values under this particular distribution.\n```\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Negative binomial\",\n         subtitle = \"mean = 10, theta = 0.05\" ) +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 150, y = 100, size = 8)\n```\n![](https:\/\/aosmith.rbind.io\/post\/2019-03-06-lots-of-zeros_files\/figure-html\/unnamed-chunk-3-1.png)\n\n## Generalized Poisson with many zeros\nI don’t know the generalized Poisson distribution well, although it appears to be regularly used in some fields. For whatever reason, the negative binomial seems much more common in ecology. 🤷\n\nFrom my understanding, the generalized Poisson distribution can have heavier tails than the negative binomial. This would mean that it can have more extreme maximum counts as well as lots of zeros.\n\nSee the documentation for `rgenpois()` for the formula for the density of the generalized Poisson and definitions of mean and variance. Note that when `lambda2` is 0, the generalized Poisson reduces to the Poisson.\n```\nset.seed(16)\ndat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )\n```\nBelow is a histogram of these data. Just over 50% of the values are zeros but the maximum count is over 1000! 💥\n```\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Generalized Poisson\",\n         subtitle = \"lambda1 = 0.5, lambda2 = 0.95\") +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 600, y = 100, size = 8)\n```\n![](https:\/\/aosmith.rbind.io\/post\/2019-03-06-lots-of-zeros_files\/figure-html\/unnamed-chunk-5-1.png)\n\n## Lots of zeros or excess zeros?\nAll the simulations above show us is that some distributions *can* have a lot of zeros. In any given scenario, though, how do we check if we have *excess* zeros? Having excess zeros means there are more zeros than expected by the distribution we are using for modeling. If we have excess zeros than we may either need a different distribution to model the data or we could think about models that specifically address zero inflation.\n\nThe key to checking for excess zeros is to estimate the number of zeros you would expect to see if the fitted model were truly the model that created your data and compare that to the number of zeros in the actual data. If there are many more zeros in the data than the model allows for then you have zero inflation compared to whatever distribution you are using.\n\n## Simulate negative binomial data\nI’ll now simulate data based on a negative binomial model with a single, continuous explanatory variable. I’ll use a model fit to these data to show how to check for excess zeros.\n\nSince this is a generalized linear model, I first calculate the means based on the linear predictor. The exponentiation is due to using the natural log link to *link* the mean to the linear predictor.\n```\nset.seed(16)\nx = runif(200, 5, 10) # simulate explanatory variable\nb0 = 1 # set value of intercept\nb1 = 0.25 # set value of slope\nmeans = exp(b0 + b1*x) # calculate true means\ntheta = 0.25 # true theta\n```\nI can use these true means along with my chosen value of `theta` to simulate data from the negative binomial distribution.\n```\ny = rnbinom(200, mu = means, size = theta)\n```\nNow that I’ve made some data I can fit a model. Since I’m using a negative binomial GLM with `x` as the explanatory variable, which is how I created the data, this model should work well. The `glm.nb()` function is from package **MASS**.\n```\nfit1 = glm.nb(y ~ x)\n```\nIn this exercise I’m going to go directly to checking for excess zeros. This means I’m skipping other important checks of model fit, such as checks for overdispersion and examining residual plots. Don’t skip these in a real analysis; having excess zeros certainly isn’t the only problem we can run into with count data.\n\n## Checking for excess zeros\nThe observed data has 76 zeros (out of 200).\n```\nsum(y == 0)\n```\n```\n# [1] 76\n```\nHow many zeros is expected given the model? I need the model estimated means and theta to answer this question. I can get the means via `predict()` and I can pull `theta` out of the model `summary()`.\n```\npreds = predict(fit1, type = \"response\") # estimated means\nesttheta = summary(fit1)$theta # estimated theta\n```\nFor discrete distributions like the negative binomial, the *density* distribution functions (which start with the letter “d”) return the probability that the observation is equal to a given value. This means I can use `dnbinom()` to calculate the probability of an observation being 0 for every row in the dataset. To do this I need to provide values for the parameters of the distribution of each observation.\n\nBased on the model, the distribution of each observation is negative binomial with the mean estimated from the model and the overall estimated theta.\n```\nprop0 = dnbinom(x = 0, mu = preds, size = esttheta )\n```\nThe sum of these probabilities is an estimate of the number of zero values expected by the model (see [here](https:\/\/data.library.virginia.edu\/getting-started-with-hurdle-models\/) for another example). I’ll round this to the nearest integer.\n```\nround( sum(prop0) )\n```\n```\n# [1] 72\n```\nThe expected number of 0 values is ~72, very close to the 76 observed in the data. This is no big surprise, since I fit the same model that I used to create the data.\n\n## An example with excess zeros\nThe example above demonstrates a model without excess zeros. Let me finish by fitting a model to data that has more zeros than expected by the distribution. This can be done by fitting a Poisson GLM instead of a negative binomial GLM to my simulated data.\n```\nfit2 = glm(y ~ x, family = poisson)\n```\nRemember the data contain 76 zeros.\n```\nsum(y == 0)\n```\n```\n# [1] 76\n```\nUsing `dpois()`, the number of zeros given be the Poisson model is 0. 😮 These data are zero-inflated compared to the Poisson distribution, and I clearly need a different approach for modeling these data.\n```\nround( sum( dpois(x = 0,\n           lambda = predict(fit2, type = \"response\") ) ) )\n```\n```\n# [1] 0\n```\nThis brings me back to my earlier point about checking model fit. If I had done other standard checks of model fit for `fit2` I would have seen additional problems that would indicate the Poisson distribution did not fit these data (such as severe overdispersion).\n\n## Just the code, please\nHere’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](https:\/\/aosmith.rbind.io\/script\/2019-03-06-lots-of-zeros.R).\n```\nlibrary(ggplot2) # v. 3.1.0\nlibrary(HMMpa) # v. 1.0.1\nlibrary(MASS) # v. 7.3-51.1\n\nset.seed(16)\ndat = data.frame(Y = rnbinom(200, mu = 10, size = .05) )\n\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Negative binomial\",\n         subtitle = \"mean = 10, theta = 0.05\" ) +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 150, y = 100, size = 8)\n\nset.seed(16)\ndat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) )\n\nggplot(dat, aes(x = Y) ) +\n    geom_histogram(binwidth = 5)  +\n    theme_bw(base_size = 18) +\n    labs(y = \"Frequency\",\n         title = \"Generalized Poisson\",\n         subtitle = \"lambda1 = 0.5, lambda2 = 0.95\") +\n    annotate(geom = \"text\",\n            label = paste(\"Proportion 0:\", mean(dat$Y == 0), \n                        \"\\nMax Count:\", max(dat$Y) ),\n                        x = 600, y = 100, size = 8)\n\nset.seed(16)\nx = runif(200, 5, 10) # simulate explanatory variable\nb0 = 1 # set value of intercept\nb1 = 0.25 # set value of slope\nmeans = exp(b0 + b1*x) # calculate true means\ntheta = 0.25 # true theta\ny = rnbinom(200, mu = means, size = theta)\n\nfit1 = glm.nb(y ~ x)\n\nsum(y == 0)\n\npreds = predict(fit1, type = \"response\") # estimated means\nesttheta = summary(fit1)$theta # estimated theta\n\nprop0 = dnbinom(x = 0, mu = preds, size = esttheta )\nround( sum(prop0) )\n\nfit2 = glm(y ~ x, family = poisson)\nsum(y == 0)\n\nround( sum( dpois(x = 0,\n           lambda = predict(fit2, type = \"response\") ) ) )\n```","meta_canonical":null,"ml_categories_json":"{\"\/Science\":736,\"\/Science\/Mathematics\":690,\"\/Science\/Mathematics\/Statistics\":678,\"\/Computers_and_Electronics\":137,\"\/Computers_and_Electronics\/Software\":100}","ml_types_json":"{\"\/Article\":996,\"\/Article\/Tutorial_or_Guide\":958}","ml_intent_types_json":"{\"Informational\":999}","meta_language":null,"attrs_author":"Ariel Muldoon","attrs_publish_time":1551830400,"attrs_original_publish_time":1551830400,"attrs_is_republished":0,"attrs_nr_words":"1894","attrs_boilerpipe_nr_words":"1682","body_ext_links_number":11,"body_int_links_number":16,"meta_nofollow":0,"meta_noarchive":0,"props_was_rendered":0,"src_redirect":"","download_time_msec":844,"download_ttfb_msec":843,"download_size":7412}

3. Robots.txt Check

Query:

Response:

4. Spam/Ban Check

Query:

Response:

5. Seen Status Check

ℹ️ Skipped - page is already crawled

📄

INDEXABLE

✅

CRAWLED

13 hours ago

🤖

ROBOTS ALLOWED

Page Info Filters

Filter	Status	Condition	Details
HTTP status	PASS	`download_http_code = 200`	HTTP 200
Age cutoff	PASS	`download_stamp > now() - 6 MONTH`	0 months ago
History drop	PASS	`isNull(history_drop_reason)`	No drop reason
Spam/ban	PASS	`fh_dont_index != 1 AND ml_spam_score = 0`	ml_spam_score=0
Canonical	PASS	`meta_canonical IS NULL OR = '' OR = src_unparsed`	Not set

Page Details

Property

Value

URL

https://aosmith.rbind.io/2019/03/06/lots-of-zeros/

Last Crawled

2026-04-28 01:18:37 (13 hours ago)

First Indexed

2019-03-06 20:37:57 (7 years ago)

HTTP Status Code

200

Content

Meta Title

Lots of zeros or too many zeros?: Thinking about zero inflation in count data

Meta Description

When working with counts, having many zeros does not necessarily indicate zero inflation. I demonstrate this by simulating data from the negative binomial and generalized Poisson distributions. I then show one way to check if the data has excess zeros compared to the number of zeros expected based on the model.

Meta Canonical

null

Boilerpipe Text

In a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation. Zero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros! Table of Contents Load packages and dataset Negative binomial with many zeros Generalized Poisson with many zeros Lots of zeros or excess zeros? Simulate negative binomial data Checking for excess zeros An example with excess zeros Just the code, please Load packages and dataset I’m going to be simulating counts from different distributions to demonstrate this. First I’ll load the packages I’m using today. Package HMMpa is for a function to draw random samples from the generalized Poisson distribution. library(ggplot2) # v. 3.1.0 library(HMMpa) # v. 1.0.1 library(MASS) # v. 7.3-51.1 Negative binomial with many zeros First I’ll draw 200 counts from a negative binomial with a mean ( $\lambda$ ) of $10$ and $\theta = 0.05$ . R uses the parameterization of the negative binomial where the variance of the distribution is $\lambda + (\lambda^2/\theta)$ . In this parameterization, as $\theta$ gets small the variance gets big. Using a very small value of theta like I am will generally mean the distribution of counts will have many zeros as well as a few large counts I pull a random sample of size 200 from this distribution using rnbinom() . The mu argument is the mean and the size argument is theta. set.seed(16) dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) ) Below is a histogram of these data. I’ve annotated the plot with the proportion of the 200 values that are 0 as well as the maximum observed count in the dataset. There are lots of zeros! But these data are not zero-inflated because we expect to have many 0 values under this particular distribution. ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Negative binomial", subtitle = "mean = 10, theta = 0.05" ) + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 150, y = 100, size = 8) Generalized Poisson with many zeros I don’t know the generalized Poisson distribution well, although it appears to be regularly used in some fields. For whatever reason, the negative binomial seems much more common in ecology. 🤷 From my understanding, the generalized Poisson distribution can have heavier tails than the negative binomial. This would mean that it can have more extreme maximum counts as well as lots of zeros. See the documentation for rgenpois() for the formula for the density of the generalized Poisson and definitions of mean and variance. Note that when lambda2 is 0, the generalized Poisson reduces to the Poisson. set.seed(16) dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) ) Below is a histogram of these data. Just over 50% of the values are zeros but the maximum count is over 1000! 💥 ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Generalized Poisson", subtitle = "lambda1 = 0.5, lambda2 = 0.95") + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 600, y = 100, size = 8) Lots of zeros or excess zeros? All the simulations above show us is that some distributions can have a lot of zeros. In any given scenario, though, how do we check if we have excess zeros? Having excess zeros means there are more zeros than expected by the distribution we are using for modeling. If we have excess zeros than we may either need a different distribution to model the data or we could think about models that specifically address zero inflation. The key to checking for excess zeros is to estimate the number of zeros you would expect to see if the fitted model were truly the model that created your data and compare that to the number of zeros in the actual data. If there are many more zeros in the data than the model allows for then you have zero inflation compared to whatever distribution you are using. Simulate negative binomial data I’ll now simulate data based on a negative binomial model with a single, continuous explanatory variable. I’ll use a model fit to these data to show how to check for excess zeros. Since this is a generalized linear model, I first calculate the means based on the linear predictor. The exponentiation is due to using the natural log link to link the mean to the linear predictor. set.seed(16) x = runif(200, 5, 10) # simulate explanatory variable b0 = 1 # set value of intercept b1 = 0.25 # set value of slope means = exp(b0 + b1*x) # calculate true means theta = 0.25 # true theta I can use these true means along with my chosen value of theta to simulate data from the negative binomial distribution. y = rnbinom(200, mu = means, size = theta) Now that I’ve made some data I can fit a model. Since I’m using a negative binomial GLM with x as the explanatory variable, which is how I created the data, this model should work well. The glm.nb() function is from package MASS . fit1 = glm.nb(y ~ x) In this exercise I’m going to go directly to checking for excess zeros. This means I’m skipping other important checks of model fit, such as checks for overdispersion and examining residual plots. Don’t skip these in a real analysis; having excess zeros certainly isn’t the only problem we can run into with count data. Checking for excess zeros The observed data has 76 zeros (out of 200). sum(y == 0) # [1] 76 How many zeros is expected given the model? I need the model estimated means and theta to answer this question. I can get the means via predict() and I can pull theta out of the model summary() . preds = predict(fit1, type = "response") # estimated means esttheta = summary(fit1)$theta # estimated theta For discrete distributions like the negative binomial, the density distribution functions (which start with the letter “d”) return the probability that the observation is equal to a given value. This means I can use dnbinom() to calculate the probability of an observation being 0 for every row in the dataset. To do this I need to provide values for the parameters of the distribution of each observation. Based on the model, the distribution of each observation is negative binomial with the mean estimated from the model and the overall estimated theta. prop0 = dnbinom(x = 0, mu = preds, size = esttheta ) The sum of these probabilities is an estimate of the number of zero values expected by the model (see here for another example). I’ll round this to the nearest integer. round( sum(prop0) ) # [1] 72 The expected number of 0 values is ~72, very close to the 76 observed in the data. This is no big surprise, since I fit the same model that I used to create the data. An example with excess zeros The example above demonstrates a model without excess zeros. Let me finish by fitting a model to data that has more zeros than expected by the distribution. This can be done by fitting a Poisson GLM instead of a negative binomial GLM to my simulated data. fit2 = glm(y ~ x, family = poisson) Remember the data contain 76 zeros. sum(y == 0) # [1] 76 Using dpois() , the number of zeros given be the Poisson model is 0. 😮 These data are zero-inflated compared to the Poisson distribution, and I clearly need a different approach for modeling these data. round( sum( dpois(x = 0, lambda = predict(fit2, type = "response") ) ) ) # [1] 0 This brings me back to my earlier point about checking model fit. If I had done other standard checks of model fit for fit2 I would have seen additional problems that would indicate the Poisson distribution did not fit these data (such as severe overdispersion). Just the code, please Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code from here . library(ggplot2) # v. 3.1.0 library(HMMpa) # v. 1.0.1 library(MASS) # v. 7.3-51.1 set.seed(16) dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) ) ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Negative binomial", subtitle = "mean = 10, theta = 0.05" ) + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 150, y = 100, size = 8) set.seed(16) dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) ) ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Generalized Poisson", subtitle = "lambda1 = 0.5, lambda2 = 0.95") + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 600, y = 100, size = 8) set.seed(16) x = runif(200, 5, 10) # simulate explanatory variable b0 = 1 # set value of intercept b1 = 0.25 # set value of slope means = exp(b0 + b1*x) # calculate true means theta = 0.25 # true theta y = rnbinom(200, mu = means, size = theta) fit1 = glm.nb(y ~ x) sum(y == 0) preds = predict(fit1, type = "response") # estimated means esttheta = summary(fit1)$theta # estimated theta prop0 = dnbinom(x = 0, mu = preds, size = esttheta ) round( sum(prop0) ) fit2 = glm(y ~ x, family = poisson) sum(y == 0) round( sum( dpois(x = 0, lambda = predict(fit2, type = "response") ) ) )

Markdown

[Very statisticious](https://aosmith.rbind.io/) [![](https://aosmith.rbind.io/img/main/ao_ppg2.jpg)](https://aosmith.rbind.io/#about) #### Ariel Muldoon ##### I currently work as an applied statistician in aviation and aeronautics. In a previous role as a consulting statistician in academia I created and taught R workshops for applied science graduate students who are just getting started in R, where my goal was to make their transition to a programming language as smooth as possible. See these workshop materials at [my website](http://ariel.rbind.io/). - [Home](https://aosmith.rbind.io/) - [Tags](https://aosmith.rbind.io/tags) - [About](https://aosmith.rbind.io/#about) - [Resume](https://ariel.rbind.io/files/acm_resume.html) - [Email](mailto:ariel.muldoon@gmail.com) - [Twitter](https://twitter.com/aosmith16) - [GitHub](https://github.com/aosmith16) - [Stack Overflow](https://stackoverflow.com/users/2461552) - [RSS](https://aosmith.rbind.io/index.xml) - [R Weekly](https://rweekly.org/) - [R-bloggers](https://www.r-bloggers.com/) # Lots of zeros or too many zeros?: Thinking about zero inflation in count data March 6, 2019 · [@aosmith16](https://twitter.com/aosmith16)\&nbsp · [View source](https://github.com/aosmith16/aosmith/blob/master/content/post/2019-03-06-lots-of-zeros.Rmd)\&nbsp [glmm](https://aosmith.rbind.io/tags/glmm), [simulation](https://aosmith.rbind.io/tags/simulation), [teaching](https://aosmith.rbind.io/tags/teaching) In a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation. Zero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros\! ## Table of Contents - [Load packages and dataset](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#load-packages-and-dataset) - [Negative binomial with many zeros](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#negative-binomial-with-many-zeros) - [Generalized Poisson with many zeros](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#generalized-poisson-with-many-zeros) - [Lots of zeros or excess zeros?](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#lots-of-zeros-or-excess-zeros) - [Simulate negative binomial data](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#simulate-negative-binomial-data) - [Checking for excess zeros](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#checking-for-excess-zeros) - [An example with excess zeros](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#an-example-with-excess-zeros) - [Just the code, please](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#just-the-code-please) # Load packages and dataset I’m going to be simulating counts from different distributions to demonstrate this. First I’ll load the packages I’m using today. Package **HMMpa** is for a function to draw random samples from the generalized Poisson distribution. ``` library(ggplot2) # v. 3.1.0 library(HMMpa) # v. 1.0.1 library(MASS) # v. 7.3-51.1 ``` # Negative binomial with many zeros First I’ll draw 200 counts from a negative binomial with a mean (\$\\lambda\$) of \$10\$ and \$\\theta = 0.05\$. R uses the parameterization of the negative binomial where the variance of the distribution is \$\\lambda + (\\lambda^2/\\theta)\$. In this parameterization, as \$\\theta\$ gets small the variance gets big. Using a very small value of theta like I am will generally mean the distribution of counts will have many zeros as well as a few large counts I pull a random sample of size 200 from this distribution using `rnbinom()`. The `mu` argument is the mean and the `size` argument is theta. ``` set.seed(16) dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) ) ``` Below is a histogram of these data. I’ve annotated the plot with the proportion of the 200 values that are 0 as well as the maximum observed count in the dataset. There are lots of zeros! But these data are not zero-inflated because we expect to have many 0 values under this particular distribution. ``` ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Negative binomial", subtitle = "mean = 10, theta = 0.05" ) + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 150, y = 100, size = 8) ``` ![](https://aosmith.rbind.io/post/2019-03-06-lots-of-zeros_files/figure-html/unnamed-chunk-3-1.png) # Generalized Poisson with many zeros I don’t know the generalized Poisson distribution well, although it appears to be regularly used in some fields. For whatever reason, the negative binomial seems much more common in ecology. 🤷 From my understanding, the generalized Poisson distribution can have heavier tails than the negative binomial. This would mean that it can have more extreme maximum counts as well as lots of zeros. See the documentation for `rgenpois()` for the formula for the density of the generalized Poisson and definitions of mean and variance. Note that when `lambda2` is 0, the generalized Poisson reduces to the Poisson. ``` set.seed(16) dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) ) ``` Below is a histogram of these data. Just over 50% of the values are zeros but the maximum count is over 1000! 💥 ``` ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Generalized Poisson", subtitle = "lambda1 = 0.5, lambda2 = 0.95") + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 600, y = 100, size = 8) ``` ![](https://aosmith.rbind.io/post/2019-03-06-lots-of-zeros_files/figure-html/unnamed-chunk-5-1.png) # Lots of zeros or excess zeros? All the simulations above show us is that some distributions *can* have a lot of zeros. In any given scenario, though, how do we check if we have *excess* zeros? Having excess zeros means there are more zeros than expected by the distribution we are using for modeling. If we have excess zeros than we may either need a different distribution to model the data or we could think about models that specifically address zero inflation. The key to checking for excess zeros is to estimate the number of zeros you would expect to see if the fitted model were truly the model that created your data and compare that to the number of zeros in the actual data. If there are many more zeros in the data than the model allows for then you have zero inflation compared to whatever distribution you are using. # Simulate negative binomial data I’ll now simulate data based on a negative binomial model with a single, continuous explanatory variable. I’ll use a model fit to these data to show how to check for excess zeros. Since this is a generalized linear model, I first calculate the means based on the linear predictor. The exponentiation is due to using the natural log link to *link* the mean to the linear predictor. ``` set.seed(16) x = runif(200, 5, 10) # simulate explanatory variable b0 = 1 # set value of intercept b1 = 0.25 # set value of slope means = exp(b0 + b1*x) # calculate true means theta = 0.25 # true theta ``` I can use these true means along with my chosen value of `theta` to simulate data from the negative binomial distribution. ``` y = rnbinom(200, mu = means, size = theta) ``` Now that I’ve made some data I can fit a model. Since I’m using a negative binomial GLM with `x` as the explanatory variable, which is how I created the data, this model should work well. The `glm.nb()` function is from package **MASS**. ``` fit1 = glm.nb(y ~ x) ``` In this exercise I’m going to go directly to checking for excess zeros. This means I’m skipping other important checks of model fit, such as checks for overdispersion and examining residual plots. Don’t skip these in a real analysis; having excess zeros certainly isn’t the only problem we can run into with count data. # Checking for excess zeros The observed data has 76 zeros (out of 200). ``` sum(y == 0) ``` ``` # [1] 76 ``` How many zeros is expected given the model? I need the model estimated means and theta to answer this question. I can get the means via `predict()` and I can pull `theta` out of the model `summary()`. ``` preds = predict(fit1, type = "response") # estimated means esttheta = summary(fit1)$theta # estimated theta ``` For discrete distributions like the negative binomial, the *density* distribution functions (which start with the letter “d”) return the probability that the observation is equal to a given value. This means I can use `dnbinom()` to calculate the probability of an observation being 0 for every row in the dataset. To do this I need to provide values for the parameters of the distribution of each observation. Based on the model, the distribution of each observation is negative binomial with the mean estimated from the model and the overall estimated theta. ``` prop0 = dnbinom(x = 0, mu = preds, size = esttheta ) ``` The sum of these probabilities is an estimate of the number of zero values expected by the model (see [here](https://data.library.virginia.edu/getting-started-with-hurdle-models/) for another example). I’ll round this to the nearest integer. ``` round( sum(prop0) ) ``` ``` # [1] 72 ``` The expected number of 0 values is ~72, very close to the 76 observed in the data. This is no big surprise, since I fit the same model that I used to create the data. # An example with excess zeros The example above demonstrates a model without excess zeros. Let me finish by fitting a model to data that has more zeros than expected by the distribution. This can be done by fitting a Poisson GLM instead of a negative binomial GLM to my simulated data. ``` fit2 = glm(y ~ x, family = poisson) ``` Remember the data contain 76 zeros. ``` sum(y == 0) ``` ``` # [1] 76 ``` Using `dpois()`, the number of zeros given be the Poisson model is 0. 😮 These data are zero-inflated compared to the Poisson distribution, and I clearly need a different approach for modeling these data. ``` round( sum( dpois(x = 0, lambda = predict(fit2, type = "response") ) ) ) ``` ``` # [1] 0 ``` This brings me back to my earlier point about checking model fit. If I had done other standard checks of model fit for `fit2` I would have seen additional problems that would indicate the Poisson distribution did not fit these data (such as severe overdispersion). # Just the code, please Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](https://aosmith.rbind.io/script/2019-03-06-lots-of-zeros.R). ``` library(ggplot2) # v. 3.1.0 library(HMMpa) # v. 1.0.1 library(MASS) # v. 7.3-51.1 set.seed(16) dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) ) ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Negative binomial", subtitle = "mean = 10, theta = 0.05" ) + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 150, y = 100, size = 8) set.seed(16) dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) ) ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Generalized Poisson", subtitle = "lambda1 = 0.5, lambda2 = 0.95") + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 600, y = 100, size = 8) set.seed(16) x = runif(200, 5, 10) # simulate explanatory variable b0 = 1 # set value of intercept b1 = 0.25 # set value of slope means = exp(b0 + b1*x) # calculate true means theta = 0.25 # true theta y = rnbinom(200, mu = means, size = theta) fit1 = glm.nb(y ~ x) sum(y == 0) preds = predict(fit1, type = "response") # estimated means esttheta = summary(fit1)$theta # estimated theta prop0 = dnbinom(x = 0, mu = preds, size = esttheta ) round( sum(prop0) ) fit2 = glm(y ~ x, family = poisson) sum(y == 0) round( sum( dpois(x = 0, lambda = predict(fit2, type = "response") ) ) ) ``` [glmm](https://aosmith.rbind.io/tags/glmm/) [simulation](https://aosmith.rbind.io/tags/simulation/) [teaching](https://aosmith.rbind.io/tags/teaching/) Please enable JavaScript to view the [comments powered by Disqus.](https://disqus.com/?ref_noscript) © 2024 Ariel Muldoon [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) - [%!(EXTRA string=Facebook)](https://www.facebook.com/sharer/sharer.php?u=https%3A%2F%2Faosmith.rbind.io%2F2019%2F03%2F06%2Flots-of-zeros%2F) - [%!(EXTRA string=Twitter)](https://twitter.com/intent/tweet?text=https%3A%2F%2Faosmith.rbind.io%2F2019%2F03%2F06%2Flots-of-zeros%2F) ![](https://aosmith.rbind.io/img/main/ao_ppg2.jpg) #### Ariel Muldoon I currently work as an applied statistician in aviation and aeronautics. In a previous role as a consulting statistician in academia I created and taught R workshops for applied science graduate students who are just getting started in R, where my goal was to make their transition to a programming language as smooth as possible. See these workshop materials at [my website](http://ariel.rbind.io/).

Readable Markdown

In a recent lecture I gave a basic overview of zero-inflation in count distributions. My main take-home message to the students that I thought worth posting about here is that having a lot of zero values does not necessarily mean you have zero inflation. Zero inflation is when there are more 0 values in the data than the distribution allows for. But some distributions can have a lot of zeros\! ## Table of Contents - [Load packages and dataset](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#load-packages-and-dataset) - [Negative binomial with many zeros](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#negative-binomial-with-many-zeros) - [Generalized Poisson with many zeros](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#generalized-poisson-with-many-zeros) - [Lots of zeros or excess zeros?](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#lots-of-zeros-or-excess-zeros) - [Simulate negative binomial data](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#simulate-negative-binomial-data) - [Checking for excess zeros](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#checking-for-excess-zeros) - [An example with excess zeros](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#an-example-with-excess-zeros) - [Just the code, please](https://aosmith.rbind.io/2019/03/06/lots-of-zeros/#just-the-code-please) ## Load packages and dataset I’m going to be simulating counts from different distributions to demonstrate this. First I’ll load the packages I’m using today. Package **HMMpa** is for a function to draw random samples from the generalized Poisson distribution. ``` library(ggplot2) # v. 3.1.0 library(HMMpa) # v. 1.0.1 library(MASS) # v. 7.3-51.1 ``` ## Negative binomial with many zeros First I’ll draw 200 counts from a negative binomial with a mean (\$\\lambda\$) of \$10\$ and \$\\theta = 0.05\$. R uses the parameterization of the negative binomial where the variance of the distribution is \$\\lambda + (\\lambda^2/\\theta)\$. In this parameterization, as \$\\theta\$ gets small the variance gets big. Using a very small value of theta like I am will generally mean the distribution of counts will have many zeros as well as a few large counts I pull a random sample of size 200 from this distribution using `rnbinom()`. The `mu` argument is the mean and the `size` argument is theta. ``` set.seed(16) dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) ) ``` Below is a histogram of these data. I’ve annotated the plot with the proportion of the 200 values that are 0 as well as the maximum observed count in the dataset. There are lots of zeros! But these data are not zero-inflated because we expect to have many 0 values under this particular distribution. ``` ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Negative binomial", subtitle = "mean = 10, theta = 0.05" ) + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 150, y = 100, size = 8) ``` ![](https://aosmith.rbind.io/post/2019-03-06-lots-of-zeros_files/figure-html/unnamed-chunk-3-1.png) ## Generalized Poisson with many zeros I don’t know the generalized Poisson distribution well, although it appears to be regularly used in some fields. For whatever reason, the negative binomial seems much more common in ecology. 🤷 From my understanding, the generalized Poisson distribution can have heavier tails than the negative binomial. This would mean that it can have more extreme maximum counts as well as lots of zeros. See the documentation for `rgenpois()` for the formula for the density of the generalized Poisson and definitions of mean and variance. Note that when `lambda2` is 0, the generalized Poisson reduces to the Poisson. ``` set.seed(16) dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) ) ``` Below is a histogram of these data. Just over 50% of the values are zeros but the maximum count is over 1000! 💥 ``` ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Generalized Poisson", subtitle = "lambda1 = 0.5, lambda2 = 0.95") + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 600, y = 100, size = 8) ``` ![](https://aosmith.rbind.io/post/2019-03-06-lots-of-zeros_files/figure-html/unnamed-chunk-5-1.png) ## Lots of zeros or excess zeros? All the simulations above show us is that some distributions *can* have a lot of zeros. In any given scenario, though, how do we check if we have *excess* zeros? Having excess zeros means there are more zeros than expected by the distribution we are using for modeling. If we have excess zeros than we may either need a different distribution to model the data or we could think about models that specifically address zero inflation. The key to checking for excess zeros is to estimate the number of zeros you would expect to see if the fitted model were truly the model that created your data and compare that to the number of zeros in the actual data. If there are many more zeros in the data than the model allows for then you have zero inflation compared to whatever distribution you are using. ## Simulate negative binomial data I’ll now simulate data based on a negative binomial model with a single, continuous explanatory variable. I’ll use a model fit to these data to show how to check for excess zeros. Since this is a generalized linear model, I first calculate the means based on the linear predictor. The exponentiation is due to using the natural log link to *link* the mean to the linear predictor. ``` set.seed(16) x = runif(200, 5, 10) # simulate explanatory variable b0 = 1 # set value of intercept b1 = 0.25 # set value of slope means = exp(b0 + b1*x) # calculate true means theta = 0.25 # true theta ``` I can use these true means along with my chosen value of `theta` to simulate data from the negative binomial distribution. ``` y = rnbinom(200, mu = means, size = theta) ``` Now that I’ve made some data I can fit a model. Since I’m using a negative binomial GLM with `x` as the explanatory variable, which is how I created the data, this model should work well. The `glm.nb()` function is from package **MASS**. ``` fit1 = glm.nb(y ~ x) ``` In this exercise I’m going to go directly to checking for excess zeros. This means I’m skipping other important checks of model fit, such as checks for overdispersion and examining residual plots. Don’t skip these in a real analysis; having excess zeros certainly isn’t the only problem we can run into with count data. ## Checking for excess zeros The observed data has 76 zeros (out of 200). ``` sum(y == 0) ``` ``` # [1] 76 ``` How many zeros is expected given the model? I need the model estimated means and theta to answer this question. I can get the means via `predict()` and I can pull `theta` out of the model `summary()`. ``` preds = predict(fit1, type = "response") # estimated means esttheta = summary(fit1)$theta # estimated theta ``` For discrete distributions like the negative binomial, the *density* distribution functions (which start with the letter “d”) return the probability that the observation is equal to a given value. This means I can use `dnbinom()` to calculate the probability of an observation being 0 for every row in the dataset. To do this I need to provide values for the parameters of the distribution of each observation. Based on the model, the distribution of each observation is negative binomial with the mean estimated from the model and the overall estimated theta. ``` prop0 = dnbinom(x = 0, mu = preds, size = esttheta ) ``` The sum of these probabilities is an estimate of the number of zero values expected by the model (see [here](https://data.library.virginia.edu/getting-started-with-hurdle-models/) for another example). I’ll round this to the nearest integer. ``` round( sum(prop0) ) ``` ``` # [1] 72 ``` The expected number of 0 values is ~72, very close to the 76 observed in the data. This is no big surprise, since I fit the same model that I used to create the data. ## An example with excess zeros The example above demonstrates a model without excess zeros. Let me finish by fitting a model to data that has more zeros than expected by the distribution. This can be done by fitting a Poisson GLM instead of a negative binomial GLM to my simulated data. ``` fit2 = glm(y ~ x, family = poisson) ``` Remember the data contain 76 zeros. ``` sum(y == 0) ``` ``` # [1] 76 ``` Using `dpois()`, the number of zeros given be the Poisson model is 0. 😮 These data are zero-inflated compared to the Poisson distribution, and I clearly need a different approach for modeling these data. ``` round( sum( dpois(x = 0, lambda = predict(fit2, type = "response") ) ) ) ``` ``` # [1] 0 ``` This brings me back to my earlier point about checking model fit. If I had done other standard checks of model fit for `fit2` I would have seen additional problems that would indicate the Poisson distribution did not fit these data (such as severe overdispersion). ## Just the code, please Here’s the code without all the discussion. Copy and paste the code below or you can download an R script of uncommented code [from here](https://aosmith.rbind.io/script/2019-03-06-lots-of-zeros.R). ``` library(ggplot2) # v. 3.1.0 library(HMMpa) # v. 1.0.1 library(MASS) # v. 7.3-51.1 set.seed(16) dat = data.frame(Y = rnbinom(200, mu = 10, size = .05) ) ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Negative binomial", subtitle = "mean = 10, theta = 0.05" ) + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 150, y = 100, size = 8) set.seed(16) dat = data.frame(Y = rgenpois(200, lambda1 = 0.5, lambda2 = 0.95) ) ggplot(dat, aes(x = Y) ) + geom_histogram(binwidth = 5) + theme_bw(base_size = 18) + labs(y = "Frequency", title = "Generalized Poisson", subtitle = "lambda1 = 0.5, lambda2 = 0.95") + annotate(geom = "text", label = paste("Proportion 0:", mean(dat$Y == 0), "\nMax Count:", max(dat$Y) ), x = 600, y = 100, size = 8) set.seed(16) x = runif(200, 5, 10) # simulate explanatory variable b0 = 1 # set value of intercept b1 = 0.25 # set value of slope means = exp(b0 + b1*x) # calculate true means theta = 0.25 # true theta y = rnbinom(200, mu = means, size = theta) fit1 = glm.nb(y ~ x) sum(y == 0) preds = predict(fit1, type = "response") # estimated means esttheta = summary(fit1)$theta # estimated theta prop0 = dnbinom(x = 0, mu = preds, size = esttheta ) round( sum(prop0) ) fit2 = glm(y ~ x, family = poisson) sum(y == 0) round( sum( dpois(x = 0, lambda = predict(fit2, type = "response") ) ) ) ```

ML Classification

ML Categories

/Science		73.6%
/Science/Mathematics		69.0%
/Science/Mathematics/Statistics		67.8%
/Computers_and_Electronics		13.7%
/Computers_and_Electronics/Software		10.0%

Raw JSON

{
    "/Science": 736,
    "/Science/Mathematics": 690,
    "/Science/Mathematics/Statistics": 678,
    "/Computers_and_Electronics": 137,
    "/Computers_and_Electronics/Software": 100
}

ML Page Types

/Article		99.6%
/Article/Tutorial_or_Guide		95.8%

Raw JSON

{
    "/Article": 996,
    "/Article/Tutorial_or_Guide": 958
}

ML Intent Types

Informational

99.9%

Raw JSON

{
    "Informational": 999
}

Content Metadata

Language

null

Author

Ariel Muldoon

Publish Time

2019-03-06 00:00:00 (7 years ago)

Original Publish Time

2019-03-06 00:00:00 (7 years ago)

Republished

Word Count (Total)

1,894

Word Count (Content)

1,682

Links

External Links

Internal Links

Technical SEO

Meta Nofollow

Meta Noarchive

JS Rendered

Redirect Target

null

Performance

Download Time (ms)

844

TTFB (ms)

843

Download Size (bytes)

7,412

Shard

166 (laksa)

Root Hash

13779606198685285766

Unparsed URL

io,rbind!aosmith,/2019/03/06/lots-of-zeros/ s443