{ "cells": [ { "cell_type": "markdown", "id": "5ddad2d4-6b8e-417b-b2f1-818ec162c523", "metadata": {}, "source": [ "## The Bootstrap\n", "\"Bootstrapping\" refers to computational techniques for making inferences about a statistic beyond point estimates by treating the samples as though they were the populations of interest. Regarding the origin of the term, [An Introduction to the Bootstrap](https://cindy.informatik.uni-bremen.de/cosy/teaching/CM_2011/Eval3/pe_efron_93.pdf) states:\n", "\n", "> The use of the term bootstrap derives from the phrase *to\n", "pull oneself up by one's bootstrap*, widely thought to be based on\n", "one of the eighteenth century Adventures of Baron Munchausen,\n", "by Rudolph Erich Raspe. (The Baron had fallen to the bottom of\n", "a deep lake. Just when it looked like all was lost, he thought to\n", "pick himself up by his own bootstraps.)\n", "\n", "Let us return to the experiment that we considered at the beginning of the discussion of permutation tests. Again, a new medical treatment is intended to prolong life after a form of surgery. Sixteen mice are randomly assigned to either a treatment group or control group. All mice receive the surgery, but only the treatment group will receive the new treatment. The survival time of each mouse after surgery is recorded below.\n" ] }, { "cell_type": "code", "execution_count": 1, "id": "7581fc1a-795e-41b1-9de5-4167100d91d5", "metadata": {}, "outputs": [], "source": [ "# survival times measured in days\n", "import numpy as np\n", "x = np.array([94, 197, 16, 38, 99, 141, 23]) # treatment group\n", "y = np.array([52, 104, 146, 10, 51, 30, 40, 27, 46]) # control group" ] }, { "cell_type": "markdown", "id": "ca5fe064-d6b2-493b-8609-4a8d3b7be900", "metadata": {}, "source": [ "The permutation test allowed us to study whether or not the treatment had any effect on the survival times. In many studies, we are interested not only in whether there is an effect; we are also interested in the _magnitude_ of the effect. It would be misleading to report only the difference in mean survival times, especially since the permutation test and t-test showed that there was a ~$14\\%$ chance of observing such an extreme difference in means due to chance alone. In addition to reporting our statistic (the difference in means), we should also report some measurement of our uncertainty.\n", "\n", "One way of quantifying our uncertainty is the _standard error_ of our statistic. Suppose we were to perform the same experiment (with new mice) repeatedly. Because the mice are random samples from some greater population and there will be some random error in the effect of the treatment, we would not observe the same value of the statistic every time; rather, the values of the statistic would form a distribution. The standard error is the standard deviation of this distribution.\n", "\n", "How do we calculate the standard error if we do not know the underlying distribution from which the mice survival times are sampled? The typical approach, which we will not discuss in detail, assumes that the underlying distributions are normal; from this assumption and some math, statisticians have derived a formula to estimate the standard error of the statistic. This approach is limited in applicability, however, as it may not produce a good estimate if the original distributions are non-normal; moreover, standard error formulas are only available for a few statistics. Instead, we take a different approach, beginning with the mild assumption that the observed samples are representative of the distributions from which they were taken. We estimate the standard error by repeatedly *resampling from the observed data* (*with replacement*), calculating the statistic of the resample each time, and computing the standard deviation of the resulting distribution. This makes sense: to estimate the standard error, we would happily resample from the distribution itself it if were available to us. It's not, so we do the next best thing, which is resampling from the data we already have." ] }, { "cell_type": "code", "execution_count": 2, "id": "e30a2a3c-dae6-4c29-aa9f-d585a4b39213", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Observed Statistic Value: 30.63492063492064\n", "Standard Error: 27.336478035417056\n" ] }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAAY4AAAEGCAYAAABy53LJAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAAsTAAALEwEAmpwYAAAd7ElEQVR4nO3de7gdVZ3m8e9LkJsNHTFBMQmdgBGMDGoMl0ZtL0ibBCTaAwqtD4i0mSjYareXII7d+vTMkxFHGlokRkTAC4g2QkZCA2Iro2MgIUK4awgRDkSJ2nIRmxh854+qA5vNPntXnZw6Zyd5P8+zn1O1aq21fzs52b9Uraq1ZJuIiIiqthvrACIiYsuSxBEREbUkcURERC1JHBERUUsSR0RE1LL9WAcwGiZMmOCpU6eOdRgREVuUG2+88Ve2J7aXbxOJY+rUqaxcuXKsw4iI2KJI+nmn8lyqioiIWpI4IiKiliSOiIioJYkjIiJqSeKIiIhakjgiIqKWJI6IiKgliSMiImpJ4oiIiFq2iSfHI5o0deEVw267btERIxhJxOjIGUdERNSSxBEREbUkcURERC1JHBERUUsSR0RE1JLEERERtSRxRERELUkcERFRS6OJQ9JsSXdJWiNpYYfjknRWeXy1pJktx86T9KCkW4fo+0OSLGlCk58hIiKerrHEIWkccDYwB5gBHCdpRlu1OcD08jUfOKfl2PnA7CH6ngIcDtw7slFHREQvTZ5xHASssb3W9kbgYmBeW515wIUuLAfGS9oTwPZ1wG+G6PsM4COAmwk9IiKG0mTimATc17I/UJbVrfM0ko4C7rd9c4968yWtlLRyw4YN1aOOiIiumkwc6lDWfoZQpc5TlaVdgNOAT/R6c9tLbM+yPWvixIm9qkdEREVNJo4BYErL/mTggWHUabUPMA24WdK6sv4qSc/f7GgjIqKSJhPHCmC6pGmSdgCOBZa21VkKHF/eXXUI8JDt9UN1aPsW23vYnmp7KkXimWn7Fw19hoiIaNNY4rC9CTgFuAq4A7jE9m2SFkhaUFZbBqwF1gBfBN472F7SRcCPgX0lDUg6qalYIyKiukYXcrK9jCI5tJYtbtk2cPIQbY+r0P/UzQwxIiJqypPjERFRSxJHRETUksQRERG1JHFEREQtjQ6OR0R3UxdesVnt1y06YoQiiaguiSNiC7Y5iSdJJ4Yrl6oiIqKWJI6IiKgliSMiImpJ4oiIiFqSOCIiopYkjoiIqCWJIyIiakniiIiIWvIAYASb/wR3xLYkZxwREVFLEkdERNSSxBEREbUkcURERC2NDo5Lmg2cCYwDzrW9qO24yuNzgceAd9peVR47DzgSeND2/i1tTgfeBGwE7gZOtP3bJj9HbBkywB0xOho745A0DjgbmAPMAI6TNKOt2hxgevmaD5zTcux8YHaHrq8B9rd9APBT4NSRjTwiIrpp8lLVQcAa22ttbwQuBua11ZkHXOjCcmC8pD0BbF8H/Ka9U9tX295U7i4HJjf2CSIi4hmaTByTgPta9gfKsrp1unkXcGWnA5LmS1opaeWGDRtqdBkREd00mTjUoczDqNO5c+k0YBPwtU7HbS+xPcv2rIkTJ1bpMiIiKmhycHwAmNKyPxl4YBh1nkHSCRQD54fZrpRoIiJiZDR5xrECmC5pmqQdgGOBpW11lgLHq3AI8JDt9d06Le/U+ihwlO3Hmgg8IiKG1ljiKAewTwGuAu4ALrF9m6QFkhaU1ZYBa4E1wBeB9w62l3QR8GNgX0kDkk4qD30O2BW4RtJNkhY39RkiIuKZGn2Ow/YyiuTQWra4ZdvAyUO0PW6I8heOZIwREVFPnhyPiIhakjgiIqKWnolD0v696kRExLajyhnHYkk3SHqvpPFNBxQREf2tZ+Kw/Srg7RTPW6yU9HVJhzceWURE9KVKYxy2fwZ8nOL5idcAZ0m6U9JfNRlcRET0nypjHAdIOoPiWYzXA2+y/eJy+4yG44uIiD5T5TmOz1E8nPcx278fLLT9gKSPNxZZRET0pSqJYy7we9tPAEjaDtjJ9mO2v9JodBER0XeqjHF8F9i5ZX+XsiwiIrZBVc44drL96OCO7Ucl7dJgTLGNytKvEVuGKmccv5M0c3BH0iuA33epHxERW7EqZxwfAL4paXCdjD2BtzUWUURE9LWeicP2Ckn7AftSrNh3p+0/NB5ZRET0parTqh8ITC3rv1wSti9sLKqIiOhbPROHpK8A+wA3AU+UxQaSOCIitkFVzjhmATOytndEREC1u6puBZ7fdCAREbFlqHLGMQG4XdINwOODhbaPaiyqiIjoW1USxz8Ot3NJs4EzgXHAubYXtR1XeXwu8BjwTturymPnAUcCD9rev6XN7sA3KAbr1wFvtf0fw40xIiLqqbIexw8ovqCfVW6vAFb1aidpHHA2MAeYARwnaUZbtTnA9PI1Hzin5dj5wOwOXS8ErrU9Hbi23I+IiFFSZVr1dwPfAr5QFk0CLqvQ90HAGttrbW8ELgbmtdWZB1zownJgvKQ9AWxfB/ymQ7/zgAvK7QuAN1eIJSIiRkiVwfGTgVcCD8OTizrtUaHdJOC+lv2BsqxunXbPs72+jGX9ULFImi9ppaSVGzZsqBBuRERUUSVxPF6eMQAgaXuK5zh6UYey9nZV6gyL7SW2Z9meNXHixJHoMiIiqJY4fiDpY8DO5Vrj3wT+T4V2AxTrlA+aDDwwjDrtfjl4Oav8+WCFWCIiYoRUSRwLgQ3ALcB/A5ZRrD/eywpguqRpknYAjgWWttVZChyvwiHAQ4OXobpYCpxQbp8AXF4hloiIGCFVJjn8I8XSsV+s07HtTZJOAa6iuB33PNu3SVpQHl9MkYTmAmsobsc9cbC9pIuA1wITJA0A/2D7S8Ai4BJJJwH3AsfUiSsiIjZPlbmq7qHDuIPtvXu1tb2MIjm0li1u2TbF4HuntscNUf5r4LBe7x0REc2oOlfVoJ0o/oe/ezPhREREv6vyAOCvW1732/5n4PXNhxYREf2oyqWqmS2721GcgezaWEQREdHXqlyq+t8t25so54dqJJqIiOh7Ve6qet1oBBIREVuGKpeq/q7bcdufHblwIiKi31W9q+pAnnp4703AdTx9jqmIiNhGVF3IaabtRwAk/SPwTdt/02RgERHRn6pMObIXsLFlfyPFIkoREbENqnLG8RXgBknfpniC/C3AhY1GFRERfavKXVX/Q9KVwKvLohNt/6TZsCIiol9VuVQFsAvwsO0zgQFJ0xqMKSIi+liVpWP/AfgocGpZ9Czgq00GFRER/avKGcdbgKOA3wHYfoBMORIRsc2qkjg2ltOfG0DSs5sNKSIi+lmVxHGJpC8A4yW9G/guNRd1ioiIrUfXu6okCfgGsB/wMLAv8Anb14xCbBER0Ye6Jg7blnSZ7VcASRbR09SFV4x1CBHRsCqXqpZLOrDxSCIiYotQJXG8jiJ53C1ptaRbJK2u0rmk2ZLukrRG0sIOxyXprPL46tZFo4ZqK+llkpZLuknSSkkHVYklIiJGxpCXqiTtZfteYM5wOpY0DjgbOBwYAFZIWmr79pZqc4Dp5etg4Bzg4B5tPw180vaVkuaW+68dTowREVFftzOOywBs/xz4rO2ft74q9H0QsMb2WtsbgYuBeW115gEXurCc4s6tPXu0NbBbuf2nwAMVYomIiBHSbXBcLdt7D6PvSTx9zY4BirOKXnUm9Wj7AeAqSZ+hSHyHdnpzSfOB+QB77bXXMMKPiIhOup1xeIjtqtShrL2foep0a/se4IO2pwAfBL7U6c1tL7E9y/asiRMnVgw5IiJ66XbG8VJJD1N8ie9cblPu2/ZuQzcFirOEKS37k3nmZaWh6uzQpe0JwPvL7W8C5/aIIyIiRtCQZxy2x9nezfautrcvtwf3eyUNgBXAdEnTJO0AHMtTy88OWgocX95ddQjwkO31Pdo+ALym3H498LPKnzYiIjZblYWchsX2JkmnAFcB44DzbN8maUF5fDGwDJgLrAEeA07s1rbs+t3AmZK2B/6TchwjIiJGR2OJA8D2Mork0Fq2uGXbwMlV25blPwReMbKRRkREVVUXcoqIiACSOCIioqZuT44/QpfbcCsOkEdExFZmyMRhe1cASZ8CfgF8heJW3LeTFQAjIrZZVS5VvdH2520/Yvth2+cA/7XpwCIioj9VuavqCUlvp5gvysBxwBONRhURjductVPWLTpiBCOJLU2VM46/Bt4K/LJ8HVOWRUTENqjnGYftdTxzVtuIiNhG9TzjkPQiSddKurXcP0DSx5sPLSIi+lGVS1VfBE4F/gBgezXF3FEREbENqpI4drF9Q1vZpiaCiYiI/lclcfxK0j6UDwNKOhpY32hUERHRt6rcjnsysATYT9L9wD0UDwFGRMQ2qEri+LntN0h6NrCd7UeaDioiIvpXlUtV90haAhwCPNpwPBER0eeqJI59ge9SXLK6R9LnJL2q2bAiIqJf9Uwctn9v+xLbfwW8HNgN+EHjkUVERF+qtB6HpNdI+jywCtiJYgqSiIjYBvUcHJd0D3ATcAnwYdu/q9q5pNnAmRTrhp9re1HbcZXH51KsOf5O26t6tZX0PuAUiudJrrD9kaoxRcTmywSJ27auiUPSOODLtj9Vt+Oy7dnA4cAAsELSUtu3t1SbA0wvXwcD5wAHd2sr6XUUc2cdYPtxSXvUjS0iIoava+Kw/UT5RV07cQAHAWtsrwWQdDHFF35r4pgHXGjbwHJJ4yXtCUzt0vY9wCLbj5cxPjiM2KKLzfnfZERs/aqMcfy/8k6qV0uaOfiq0G4ScF/L/kBZVqVOt7YvAl4t6XpJP5B0YKc3lzRf0kpJKzds2FAh3IiIqKLKA4CHlj9bzzoMvL5HO3Uoa1/DfKg63dpuDzyH4rmSA4FLJO1dnrU8VdleQvHEO7NmzRpy7fSIiKinynocrxtm3wPAlJb9ycADFevs0KXtAHBpmShukPRHYAKQ04qIiFFQZT2O50n6kqQry/0Zkk6q0PcKYLqkaZJ2oJiKfWlbnaXA8SocAjxke32PtpdRnu1IehFFkvlVhXgiImIEVBnjOB+4CnhBuf9T4AO9GtneRHHL7FXAHcAltm+TtEDSgrLaMmAtsIZi3Y/3dmtbtjkP2LtcWOpi4IT2y1QREdGcKmMcE2xfIulUKL7UJT1RpXPbyyiSQ2vZ4pZtU0xlUqltWb4ReEeV94+IiJFX5Yzjd5Key1PrcRwCPNRoVBER0beqnHH8HcX4wj6SfgRMBI5uNKqIiOhbVe6qWiXpNRSz5Aq4y/YfGo8sIiL6UpW7qo4Bdi4Hp98MfKPiA4AREbEVqjLG8d9tP1KuwfFG4AKKOaUiImIbVCVxDN5BdQRwju3LKZ6diIiIbVCVxHG/pC9QrMGxTNKOFdtFRMRWqEoCeCvFg3izbf8W2B34cJNBRURE/6qydOxjwDpgTrmA0p62r246sIiI6E9V7qr6BMWA+HMpJhP8sqSPNx1YRET0pyoPAB4HvNz2fwJIWkSx9vg/NRlYRET0pypjHOuAnVr2dwTubiSaiIjoe0OecUj6F4r5qR4HbpN0Tbl/OPDD0QkvIiL6TbdLVSvLnzcC324p/35j0URERN8bMnHYvgBA0k7ACynONu4eHOuIiIht05BjHJK2l/RpiqVaLwC+Ctwn6dOSnjVaAUZERH/pNjh+OsXDftNsv8L2y4F9gPHAZ0YhtoiI6EPdEseRwLttPzJYYPth4D3A3KYDi4iI/tQtcbjTWt62n6BcDTAiIrY93RLH7ZKOby+U9A7gziqdS5ot6S5JayQt7HBcks4qj69uXeejQtsPSbKkCVViiYiIkdHtdtyTgUslvYvillwDBwI7A2/p1bGkccDZFM99DAArJC21fXtLtTnA9PJ1MMU6Hwf3aitpSnns3hqfNSIiRkC323Hvp/gSfz3wEoplY6+0fW3Fvg8C1theCyDpYmAe0Jo45gEXlpfElksaL2lPYGqPtmcAHwEurxhLRESMkCprjn8P+N4w+p4E3NeyP0BxVtGrzqRubSUdBdxv+2ZJQ765pPnAfIC99tprGOFHREQnVSY5HK5O3+rtg+pD1elYLmkX4DTgL3u9ue0lwBKAWbNmbXOD+VMXXjHWIUR0tDm/m+sWHTGCkcRwNbmS3wAwpWV/MvBAxTpDle8DTANulrSuLF8l6fkjGnlERAypycSxApguaZqkHYBjgaVtdZYCx5d3Vx0CPGR7/VBtbd9iew/bU21PpUgwM23/osHPERERLRq7VGV7k6RTKJadHQecZ/s2SQvK44uBZRQPE64BHgNO7Na2qVgjIqK6Jsc4sL2MIjm0li1u2TbFbb+V2naoM3Xzo4yIiDqavFQVERFboSSOiIioJYkjIiJqSeKIiIhakjgiIqKWJI6IiKgliSMiImpp9DmO2DyZbyoi+lHOOCIiopYkjoiIqCWJIyIiakniiIiIWpI4IiKiliSOiIioJYkjIiJqSeKIiIhakjgiIqKWJI6IiKgliSMiImppNHFImi3pLklrJC3scFySziqPr5Y0s1dbSadLurOs/21J45v8DBER8XSNJQ5J44CzgTnADOA4STPaqs0Bppev+cA5FdpeA+xv+wDgp8CpTX2GiIh4pibPOA4C1thea3sjcDEwr63OPOBCF5YD4yXt2a2t7attbyrbLwcmN/gZIiKiTZOJYxJwX8v+QFlWpU6VtgDvAq7s9OaS5ktaKWnlhg0baoYeERFDaXI9DnUoc8U6PdtKOg3YBHyt05vbXgIsAZg1a1b7+0bENmhz1rhZt+iIEYxky9Zk4hgAprTsTwYeqFhnh25tJZ0AHAkcZjtJISJiFDV5qWoFMF3SNEk7AMcCS9vqLAWOL++uOgR4yPb6bm0lzQY+Chxl+7EG44+IiA4aO+OwvUnSKcBVwDjgPNu3SVpQHl8MLAPmAmuAx4ATu7Utu/4csCNwjSSA5bYXNPU5IiLi6Rpdc9z2Mork0Fq2uGXbwMlV25blLxzhMCMiooZGE0dExEjanMHtGDmZciQiImpJ4oiIiFqSOCIiopYkjoiIqCWD4w3KQF5EbI1yxhEREbUkcURERC1JHBERUUsSR0RE1JLEERERtSRxRERELbkdNyKigiwC9ZQkjh7yLEZExNPlUlVERNSSxBEREbUkcURERC1JHBERUUsSR0RE1NJo4pA0W9JdktZIWtjhuCSdVR5fLWlmr7aSdpd0jaSflT+f0+RniIiIp2vsdlxJ44CzgcOBAWCFpKW2b2+pNgeYXr4OBs4BDu7RdiFwre1FZUJZCHy0qc8REbG5xvK2/iaeIWnyjOMgYI3ttbY3AhcD89rqzAMudGE5MF7Snj3azgMuKLcvAN7c4GeIiIg2TT4AOAm4r2V/gOKsoledST3aPs/2egDb6yXt0enNJc0H5pe7j0q6azgfoosJwK9GuM/N1Y8xQX/GlZiqSUzV9WNcE/S/NiumP+tU2GTiUIcyV6xTpW1XtpcAS+q0qUPSStuzmup/OPoxJujPuBJTNYmpun6Mq6mYmrxUNQBMadmfDDxQsU63tr8sL2dR/nxwBGOOiIgemkwcK4DpkqZJ2gE4FljaVmcpcHx5d9UhwEPlZahubZcCJ5TbJwCXN/gZIiKiTWOXqmxvknQKcBUwDjjP9m2SFpTHFwPLgLnAGuAx4MRubcuuFwGXSDoJuBc4pqnP0ENjl8E2Qz/GBP0ZV2KqJjFV149xNRKT7FpDBxERsY3Lk+MREVFLEkdERNSSxDFMkj4kyZImtJSdWk6RcpekN45iLKdLurOctuXbksaPdUzle3edcmaUYpgi6d8l3SHpNknvL8vHfOoaSeMk/UTSd/oopvGSvlX+Pt0h6c/HOi5JHyz/7m6VdJGknUY7JknnSXpQ0q0tZUPGMBr/7oaIaVS+C5I4hkHSFIrpUO5tKZtBcffXS4DZwOfLqVNGwzXA/rYPAH4KnDrWMbVMGzMHmAEcV8Yz2jYBf2/7xcAhwMllHINT10wHri33R9v7gTta9vshpjOBf7O9H/DSMr4xi0vSJOBvgVm296e4WebYMYjpfIp/Q606xjCK/+46xTQq3wVJHMNzBvARnv5Q4jzgYtuP276H4k6xg0YjGNtX295U7i6neO5lTGOi2pQzjbO93vaqcvsRii/CSYzx1DWSJgNHAOe2FI91TLsBfwF8CcD2Rtu/Heu4KO7+3FnS9sAuFM90jWpMtq8DftNWPFQMo/LvrlNMo/VdkMRRk6SjgPtt39x2aKjpU0bbu4Ary+2xjKlf/jyeJGkq8HLgetqmrgE6Tl3ToH+m+M/HH1vKxjqmvYENwJfLS2jnSnr2WMZl+37gMxRn9+spnvW6eixjajFUDP3yu9/Yd0GTU45ssSR9F3h+h0OnAR8D/rJTsw5lI3avc7eYbF9e1jmN4tLM10Yjph7G8r2fQdKfAP8KfMD2w1Kn8EYtliOBB23fKOm1YxbIM20PzATeZ/t6SWcyNpfLnlSOG8wDpgG/Bb4p6R1jGVMFY/673/R3QRJHB7bf0Klc0n+h+AW+ufzimQysknQQ1aZYGfGYWmI7ATgSOMxPPZzTaEw9jOV7P42kZ1Ekja/ZvrQs/qWkPcuJMkd76ppXAkdJmgvsBOwm6atjHBMUf2cDtq8v979FkTjGMq43APfY3gAg6VLg0DGOadBQMYzp7/5ofBfkUlUNtm+xvYftqbanUvxlzLT9C4qpUI6VtKOkaRRrjNwwGnFJmk2xJslRth9rOTRmMVFtypnGqcjwXwLusP3ZlkNjNnWN7VNtTy5/h44Fvmf7HWMZUxnXL4D7JO1bFh0G3D7Gcd0LHCJpl/Lv8jCKcap+mHpoqBi2/u8C23kN8wWsAya07J8G3A3cBcwZxTjWUFy/vKl8LR7rmMr3nktxZ8fdFJfUxuLv6FUUp+SrW/585gLPpbgT5mflz93HKL7XAt8pt8c8JuBlwMryz+sy4DljHRfwSeBO4FbgK8COox0TcBHFGMsfKP7DeFK3GEbj390QMY3Kd0GmHImIiFpyqSoiImpJ4oiIiFqSOCIiopYkjoiIqCWJIyIiakniiC2KpO+3z+wp6QOSPt+jzayG47qonJH0g23lbx7u5I6SXlY+JFi1/nhJ761bT9ILJH1rpOrH1i+JI7Y0F1E8NNfq2LJ8TEh6PnCo7QNsn9F2+M0UswMPx8sonjmpajzQM3G017P9gO2jR7B+bOWSOGJL8y3gSEk7wpMTF74A+KGkcyStVLF2wyc7NZb0aMv20ZLOL7cnSvpXSSvK1ys7tN1J0pcl3VJOAvi68tDVwB6SbpL06pb6hwJHAaeXx/YpX/8m6UZJ/1fSfmXdY1SsN3GzpOvKp+0/BbytbPu2tlheIumG8thqSdOBRcA+Zdnpkv5E0rWSVpUxD85O3F5vqso1HSr221p/nKTPlP2vlvS+yn+TseUa7SdT88prc1/AFcC8cnshcHq5vXv5cxzwfeCAcv/7FOs5ADza0s/RwPnl9teBV5Xbe1FMUdL+vn8PfLnc3o9iOoydgKnArUPEej5wdMv+tcD0cvtgiulGAG4BJpXb48uf7wQ+N0S//wK8vdzeAdi5PQ6Kueh2K7cnUDxVrA71ntyv2G9r/fdQzAO2fevfQV5b9yuTHMaWaPBy1eXlz3eV5W+VNJ/iC3NPiktEqyv2+QZghp6aNXc3Sbu6WMNj0KsovlixfaeknwMvAh6u8gYqZug9lGKG18HiHcufPwLOl3QJcGmH5u1+DJymYl2PS23/TM+c8VfA/5T0FxTTt08CnjcC/bZ6A8W0FpsAbLevWRFboSSO2BJdBnxW0kxgZ9uryonbPgQcaPs/yktQO3Vo2zrHTuvx7YA/t/37Lu+7uXOxbwf81vbLnhGUvUDSwRSLO90k6Rl12up/XdL1Zf2rJP0NsLat2tuBicArbP9B0jo6/5nU7beVGMPp8mNsZIwjtji2H6W4/HQeTw2K7wb8DnhI0vMolqzt5JeSXixpO+AtLeVXA6cM7gzxxX0dxZcxkl5EcUnrrh7hPgLsWsb9MHCPpGPKPiTppeX2Pravt/0J4FcUU2A/2badpL2BtbbPopj59IAO9f+UYt2PP5TjMX/WHtMw+211NbBAxep8SNq9x59HbAWSOGJLdRHFmtgXA7hYkfEnwG0UCeVHQ7RbCHwH+B7FzKKD/haYVQ7w3g4s6ND288A4SbcA3wDeafvxHnFeDHy4HEzfhyLxnCTp5jLWwQHr08sB5lspEtTNwL9TXD57xuA48DbgVkk3UYy3XGj718CPykH20ykW8ZklaWX5vneWf1bt9er22+pcirGe1eVn+usefx6xFcjsuBERUUvOOCIiopYkjoiIqCWJIyIiakniiIiIWpI4IiKiliSOiIioJYkjIiJq+f8XyCrp0n5dgQAAAABJRU5ErkJggg==\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "import matplotlib.pyplot as plt\n", "\n", "rng = np.random.default_rng()\n", "\n", "def statistic(x, y, axis=0):\n", " return np.mean(x, axis=axis) - np.mean(y, axis=axis)\n", "\n", "def bootstrap_distribution(x, y):\n", " nx, ny = len(x), len(y)\n", " N = 1000\n", " bootstrap_distribution = []\n", " for i in range(N):\n", " # random indices to resample from x and y\n", " ix = rng.integers(0, nx, size=nx)\n", " iy = rng.integers(0, ny, size=ny)\n", " xi = x[ix]\n", " yi = y[iy]\n", " stat = statistic(xi, yi)\n", " bootstrap_distribution.append(stat)\n", " return bootstrap_distribution\n", "\n", "boot_dist = bootstrap_distribution(x, y)\n", "\n", "plt.hist(boot_dist, density=True, bins=20)\n", "plt.xlabel(\"Value of test statistic\")\n", "plt.ylabel(\"Observed Frequency\")\n", "\n", "observed_statistic = statistic(x, y)\n", "standard_error = np.std(boot_dist, ddof=1)\n", "print(f\"Observed Statistic Value: {observed_statistic}\")\n", "print(f\"Standard Error: {standard_error}\")" ] }, { "cell_type": "markdown", "id": "b7802b49-0a53-4f5f-95b1-a0710ab98a81", "metadata": {}, "source": [ "This is precisely what `bootstrap` does." ] }, { "cell_type": "code", "execution_count": 3, "id": "530d48d9-8944-43e3-848b-37831e3630f7", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "26.78533925848876" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "from scipy import stats\n", "# `n_resamples=1000` indicates that the statistic will be calculated for\n", "# each of 1000 resamples.\n", "# The meaning of `method='percentile'` will be discussed below\n", "res = stats.bootstrap((x, y), statistic, n_resamples=1000, method='percentile')\n", "assert res.standard_error == np.std(res.bootstrap_distribution, ddof=1)\n", "res.standard_error" ] }, { "cell_type": "markdown", "id": "bce2cf1c-4a82-4438-afb7-1832690132f2", "metadata": {}, "source": [ "The two standard errors estimates differ slightly because the bootstrap algorithm is inherently stochastic, but that is OK. The best we can hope for is an approximation, and these two approximations agree with one another quite well.\n", "\n", "An even better way of quantifying the uncertainty, especially when the distribution of the statisic is non-normal, is to produce a *confidence interval* on the statistic. Suppose we perform the experiment repeatedly and produce a \"95% confidence interval\" $(l_i, u_i)$ from the data in each experiment $i$; this means that we should expect the true value of the statistic (the difference in the *population* means) to be between $l_i$ and $u_i$ in 95% of the replications $i$." ] }, { "cell_type": "code", "execution_count": 4, "id": "590cc9d0-0b34-4faf-829b-8302b4230553", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ConfidenceInterval(low=-18.494047619047613, high=82.84563492063491)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res.confidence_interval # 95% confidence interval by default" ] }, { "cell_type": "markdown", "id": "935b014a-ec43-4c5f-a480-c804edea3e6a", "metadata": {}, "source": [ "By choosing `method='percentile'` above, we indicated that bootstrap should estimate this confidence interval as the central 95% of the bootstrap distribution - that is, the boundaries of our interval will be the 2.5 and 97.5 percentiles of the bootstrap distribution." ] }, { "cell_type": "code", "execution_count": 5, "id": "86857327-0c35-4f5b-9529-9b1c65edd133", "metadata": {}, "outputs": [], "source": [ "ci_percentile = stats.scoreatpercentile(res.bootstrap_distribution, [2.5, 97.5])\n", "np.testing.assert_allclose(res.confidence_interval, ci_percentile) # confidence interval is the central 95% of the bootstrap distribution " ] }, { "cell_type": "markdown", "id": "86f64385-b4a6-4015-8222-3adc85572384", "metadata": {}, "source": [ "Again, this means that if we were to perform the mice experiment repeatedly and each time use `bootstrap` to compute such a confidence interval from the data, we would expect the confidence interval to contain the true value of the difference in mean survival times 95% of the time. Note also that our confidence interval contains 0. This is closely related to our conclusion from the hypothesis tests above: our data is not inconsistent with the null hypothesis that the treatment has no effect." ] }, { "cell_type": "markdown", "id": "9e000136-9159-4361-9da8-b2d0f61d30ae", "metadata": {}, "source": [ "### Single-Sample, Scalar-Valued Statistics (and Confidence Intervals)\n", "This definition of a confidence interval can be difficult to interpret correctly, so we illustrate with a simpler example. Suppose there is an election with only two candidates, `0` and `1`, and all voters will vote for either one or the other (never both, and never for neither). We wish to estimate the percentage of voters who will vote for candidate `1` by performing an experiment before the election: we will ask a random sample of 1000 voters who the will vote for on election day. The results are stored in the array `sample`." ] }, { "cell_type": "code", "execution_count": 6, "id": "2a72dd5a-7de6-4d7b-8d6d-e618f7c9f3fa", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "249 for candidate 0, 751 for candidate 1\n" ] } ], "source": [ "# Rather than entering `sample` directly, let's generate one to work with.\n", "# To simulate the results of such an experiment, suppose that the true \n", "# (but unknown) percentage of voters who will vote for candidate `1` is 75%. \n", "# If we sample voters at random from the population before the election and \n", "# ask them who they will vote for, the responses will follow a Bernoulli \n", "# distribution with shape parameter `p=0.75`.\n", "p = 0.75\n", "dist = stats.bernoulli(p=p)\n", "sample = dist.rvs(size=1000)\n", "vote_for_0 = np.sum(sample == 0)\n", "vote_for_1 = np.sum(sample == 1)\n", "print(f\"{vote_for_0} for candidate 0, {vote_for_1} for candidate 1\")" ] }, { "cell_type": "markdown", "id": "c9f9532d-46ee-4d59-ab3b-af452790fab1", "metadata": {}, "source": [ "The statistic we wish to estimate is the percentage of voters who will vote for candidate 1, so we can produce a *point estimate* of the statistic from the sample as:" ] }, { "cell_type": "code", "execution_count": 7, "id": "89b46532-a3b2-448f-b47a-22b92529074c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.751" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "def statistic(sample, axis=0):\n", " return np.sum(sample, axis=axis) / sample.shape[axis]\n", "statistic(sample)" ] }, { "cell_type": "markdown", "id": "b30681e0-52d0-41ac-8e90-11ecee5bf235", "metadata": {}, "source": [ "`bootstrap` can produce a confidence interval around the point estimate." ] }, { "cell_type": "code", "execution_count": 8, "id": "ee6d8613-22ca-4a59-bf66-c86ed1bdb726", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ConfidenceInterval(low=0.726, high=0.772)" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# As with `permutation_test`, the first argument of `bootstrap` needs to be a *sequence* of samples\n", "data = (sample,)\n", "# Passing `confidence_level=0.9` produces a 90% confidence interval\n", "res = stats.bootstrap(data, statistic, confidence_level=0.9)\n", "res.confidence_interval" ] }, { "cell_type": "markdown", "id": "593bb317-2f96-40a3-8f7f-1499015ea2b2", "metadata": {}, "source": [ "Suppose we perform the same experiment $100$ times, each time collecting new data from the same population, but computing the confidence interval in the same way." ] }, { "cell_type": "code", "execution_count": 9, "id": "b09ee5a2-1adb-4430-8118-ad16f2ed2549", "metadata": {}, "outputs": [], "source": [ "# lower and upper limits of confidence intervals produced by `bootstrap`\n", "n_replications = 100 # 100 replications of the same experiment\n", "n_observations = 1000 # 1000 observations per sample\n", "\n", "# Draw 100 new samples from the same population of voters, each with 1000 observations\n", "sample = dist.rvs(size=(100, 1000)) \n", "\n", "# bootstrap the 90% confidence interval for all 100 samples (10 at a time)\n", "res = stats.bootstrap((sample,), statistic, confidence_level=0.9, axis=1, batch=10)\n", "li, ui = res.confidence_interval\n", " \n", "# This was equivalent to (but faster than) the following \n", "# li = np.empty((n_replications,))\n", "# ui = np.empty((n_replications,))\n", "# for i in range(n_replications):\n", "# sample = dist.rvs(size=n_observations) # collect a new sample from the same population of voters\n", "# res = stats.bootstrap((sample,), statistic, confidence_level=0.9, vectorized=False)\n", "# li[i], ui[i] = res.confidence_interval" ] }, { "cell_type": "markdown", "id": "55cbf673-1d90-4436-9755-12c79cd4c054", "metadata": {}, "source": [ "We expect that the confidence interval will contains the true value of the statistic ($p=0.75$) approximately 90% of the time." ] }, { "cell_type": "code", "execution_count": 10, "id": "9856f8c6-4f4e-43e6-98e7-476202f27bcb", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "87\n" ] } ], "source": [ "contained = (li < p) & (p < ui)\n", "print(np.sum(contained))" ] }, { "cell_type": "markdown", "id": "c172f0e3-40cf-49ec-841e-0a3a7ab7bf64", "metadata": {}, "source": [ "### Paired-Sample, Vector-Valued Statistics\n", "\n", "[An Introduction to the Bootstrap](https://books.google.com/books?id=MWC1DwAAQBAJ&printsec=frontcover) considers a small data set collected when studying a medical device for continuously delivering an anti-inflammatory hormone to test subjects. The arrays `x` and `y` record the number of hours the device was worn and the amount of hormone remaining in the device, respectively." ] }, { "cell_type": "code", "execution_count": 11, "id": "07480ef9-594e-4eb6-9661-25cced033299", "metadata": {}, "outputs": [], "source": [ "x = np.array([99, 152, 293, 155, 196, 53, 184, 171, 52, 376, 385, 402, 29, 76, 296, 151, 177, 209, 119, 188, 115, 88, 58, 49, 150, 107, 125]) # hours worn\n", "y = np.array([25.8, 20.5, 14.3, 23.2, 20.6, 31.1, 20.9, 20.9, 30.4, 16.3, 11.6, 11.8, 32.5, 32.0, 18.0, 24.1, 26.5, 25.8, 28.8, 22.0, 29.7, 28.9, 32.8, 32.5, 25.4, 31.7, 28.5]) # amount remaining (units unspecified)" ] }, { "cell_type": "markdown", "id": "e1e416f8-7f0e-46f8-9584-388fd67602fe", "metadata": {}, "source": [ "A standard linear regression is performed in SciPy as follows." ] }, { "cell_type": "code", "execution_count": 12, "id": "f64d81c2-d847-425a-bee0-dc53e8efed05", "metadata": {}, "outputs": [ { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" }, { "name": "stdout", "output_type": "stream", "text": [ "The slope estimate is: -0.0574462986976377\n", "The intercept estimate is: 34.16752817399911\n", "The slope standard error is: 0.004464173160311544\n", "The intercept standard error is: 0.8671972620941928\n" ] } ], "source": [ "res_lr = stats.linregress(x, y)\n", "\n", "plt.plot(x, y, '.', label='original data')\n", "plt.plot(x, res_lr.intercept + res_lr.slope*x, 'r', label='fitted line')\n", "plt.legend()\n", "plt.show()\n", "print(f\"The slope estimate is: {res_lr.slope}\")\n", "print(f\"The intercept estimate is: {res_lr.intercept}\")\n", "print(f\"The slope standard error is: {res_lr.stderr}\")\n", "print(f\"The intercept standard error is: {res_lr.intercept_stderr}\")" ] }, { "cell_type": "markdown", "id": "3aa96aa6-be6a-4e9c-a55c-f5e4e14fb6c4", "metadata": {}, "source": [ "`linregress` produces point estimates of the slope and intercept as well as standard errors for each statistic, assuming that the residuals between the best fit line and the data are normally distributed. We can test the normality assumption using `stats.shapiro`." ] }, { "cell_type": "code", "execution_count": 13, "id": "b5eb0671-a179-47e7-809c-8bc39f56843d", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ShapiroResult(statistic=0.9171469211578369, pvalue=0.03371993452310562)" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "e = y - res_lr.intercept + res_lr.slope*x\n", "stats.shapiro(e)" ] }, { "cell_type": "markdown", "id": "77de4703-5400-4a4b-a928-0af8878c330a", "metadata": {}, "source": [ "Although the $p$-value is not small enough to conclusively reject the null hypothesis at all reasonable confidence levels, it does suggest that we might want to relax the residual normality assumption. `bootstrap` makes no such assumption about the residuals, and it can go beyond the standard errors, producing bias-corrected confidence intervals. The standard errors produced by `bootstrap` match those produced by `linregress` fairly well; however `linregress` may overestimate these quantities for this data." ] }, { "cell_type": "code", "execution_count": 14, "id": "b393954f-53a8-4e69-baf2-6f75f79020f3", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "The slope standard error is: 0.004260082018022721\n", "The intercept standard error is: 0.7292901429594613\n", "The confidence interval on the slope is: (-0.06516915578638195, -0.04868503288913481)\n", "The confidence interval on the intercept is: (32.58844595440181, 35.45478220570269)\n" ] } ], "source": [ "def statistic(x, y):\n", " res = stats.linregress(x, y)\n", " return res.slope, res.intercept\n", "\n", "res = stats.bootstrap((x, y), statistic, vectorized=False, paired=True)\n", "\n", "print(f\"The slope standard error is: {res.standard_error[0]}\")\n", "print(f\"The intercept standard error is: {res.standard_error[1]}\")\n", "print(f\"The confidence interval on the slope is: {res.confidence_interval.low[0], res.confidence_interval.high[0]}\")\n", "print(f\"The confidence interval on the intercept is: {res.confidence_interval.low[1], res.confidence_interval.high[1]}\")" ] }, { "cell_type": "markdown", "id": "1dceaaf7-ba8e-44d7-afb7-1cc1e74ed491", "metadata": {}, "source": [ "Again, because the statistic has multiple values, a visualization of the bootstrap distribution may be more informative." ] }, { "cell_type": "code", "execution_count": 15, "id": "662eee2a-0c36-4f9e-bbaa-8896f328aa8c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "[]" ] }, "execution_count": 15, "metadata": {}, "output_type": "execute_result" }, { "data": { "image/png": "\n", "text/plain": [ "
" ] }, "metadata": { "needs_background": "light" }, "output_type": "display_data" } ], "source": [ "for m, b in res.bootstrap_distribution.T[::10]:\n", " plt.plot(x, m*x + b, color='b', alpha=0.01)\n", "plt.plot(x, y, '.', label='original data')\n", "plt.plot(x, res_lr.intercept + res_lr.slope*x, 'r', label='fitted line')" ] }, { "cell_type": "markdown", "id": "a38780c7-38c8-44fe-93be-7e5347e3aea4", "metadata": {}, "source": [ "A major advantage of the bootstrap is that it can produce standard errors and confidence intervals even in more general regression models that have no simple analytical solutions, such as when the regression function is nonlinear in the parameters and when using fitting methods other than least squares." ] }, { "cell_type": "markdown", "id": "1a886e39-f2b5-4ea4-8dff-110b7ddb5b45", "metadata": {}, "source": [ "### Gotchas\n", "\n", "Our final example will show yet another application of the `bootstrap` chosen to illustrate common pitfalls.\n", "\n", "[An Introduction to the Bootstrap](https://books.google.com/books?id=MWC1DwAAQBAJ&printsec=frontcover) presents a study about whether regular doses of aspirin can prevent heart attacks. Subjects were randomly assigned to two groups: 11,037 received aspirin pills, and the remaining 11,034 received placebos. The subjects were instructed to take one pill every other day, and the scientists recorded the number of subjects who experienced a heart attack during the study period: 104 in the aspirin group, and 189 in the placebo group. The statistic to assess the effectiveness of aspirin was the relative prevalence of heart attacks in the aspirin group versus the placebo group." ] }, { "cell_type": "code", "execution_count": 16, "id": "f4153813-8a78-43c6-bd28-2cc6f89d7a33", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "0.5501149812103875" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "x = np.zeros(11037) # 11037 subjects in the aspirin group\n", "x[:104] = 1 # 104 experience a heart attack\n", "y = np.zeros(11034) # 11034 subjects in the placebo group\n", "y[:189] = 1 # 189 experience a heart attack\n", "def statistic(x, y):\n", " return (np.sum(x)/len(x))/(np.sum(y)/len(y))\n", "statistic(x, y)" ] }, { "cell_type": "markdown", "id": "d1b65fcb-6f4b-480c-a163-085e272e3249", "metadata": {}, "source": [ "The risk of heart atttack for aspirin-takers seemed to be approximately half that of placebo-takers.\n", "\n", "Suppose we wish to generate a 95% confidence interval to quantify our uncertainty." ] }, { "cell_type": "code", "execution_count": 17, "id": "2b1bcd63-2a11-4643-b036-bce7121afb4f", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "TypeError: bootstrap() takes 2 positional arguments but 3 positional arguments (and 1 keyword-only argument) were given\n" ] } ], "source": [ "try:\n", " stats.bootstrap(x, y, statistic, confidence_level=0.95)\n", "except Exception as e:\n", " print(f\"{type(e).__name__}: {e}\")" ] }, { "cell_type": "markdown", "id": "f0b5caf3-3434-41a2-b8ea-da8d1ccfe679", "metadata": {}, "source": [ "This reminds us that all the data needs to be passed in as a single sequence, not two separate arguments `x` and `y`." ] }, { "cell_type": "code", "execution_count": 18, "id": "8d2738da-e79f-4ed3-968d-6533dadb7769", "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "ValueError: `method = 'BCa' is only available for one-sample statistics\n" ] } ], "source": [ "data = (x, y) \n", "try:\n", " stats.bootstrap(data, statistic, confidence_level=0.95)\n", "except Exception as e:\n", " print(f\"{type(e).__name__}: {e}\")" ] }, { "cell_type": "markdown", "id": "aebd581b-d7df-4179-a6f1-a8d4feab9043", "metadata": {}, "source": [ "`bootstrap` offers a `method` argument that selects how the confidence interval is to be estimated from the `bootstrap` distribution; the three methods `{'BCa', 'percentile', 'basic'}` vary in their performance and accuracy. `BCa` is the most computationally intensive but tends to be the most accurate, so it is the default. However, it is currently only available when our data has only one independent sample, whereas our data consists of two independent samples `x` and `y`. Let's try another option, `percentile`." ] }, { "cell_type": "code", "execution_count": 19, "id": "33a84978-c534-4069-8f7e-3171614e0e20", "metadata": {}, "outputs": [], "source": [ "try:\n", " stats.bootstrap(data, statistic, method='basic', confidence_level=0.95)\n", "except Exception as e:\n", " print(f\"{type(e).__name__}: {e}\")" ] }, { "cell_type": "markdown", "id": "47a2f909-b392-4738-ade4-4a80e07f7071", "metadata": {}, "source": [ "Unlike `permutation_test`, `bootstrap` expects `statistic` to be vectorized by default. We can solve this by passing `vectorized=False`." ] }, { "cell_type": "code", "execution_count": 20, "id": "15046441-1e6a-45c8-bdf6-2efa274f127c", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ConfidenceInterval(low=0.4324405646124769, high=0.6926221426325354)" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = stats.bootstrap(data, statistic, method='percentile', confidence_level=0.95, vectorized=False)\n", "res.confidence_interval" ] }, { "cell_type": "markdown", "id": "09cc4689-4037-4ddb-a457-449015ed5256", "metadata": {}, "source": [ "Alternatively, we can vectorize our statistic by making it accept a parameter `axis` and having it work along the specified axis-slice of N-dimensional arrays `x` and `y`." ] }, { "cell_type": "code", "execution_count": 21, "id": "7588b203-8de2-4b52-9ced-64ad374c82a7", "metadata": {}, "outputs": [], "source": [ "def statistic(x, y, axis=0):\n", " return (np.sum(x, axis=axis)/x.shape[axis])/(np.sum(y, axis=axis)/y.shape[axis])\n", "\n", "try:\n", " res = stats.bootstrap(data, statistic, method='percentile', confidence_level=0.95)\n", " res.confidence_interval\n", "except Exception as e:\n", " print(f\"{type(e).__name__}: {e}\")" ] }, { "cell_type": "markdown", "id": "7800e793-92cb-4871-815b-06e3da89bb18", "metadata": {}, "source": [ "Depending on your computer's hardware, you may run into a MemoryError there. Vectorized computations require a lot of memory. The default value of `n_resamples` is $9,999$, and there are a total of $11,037 + 11,034 = 22,071$ observations. Therefore, the resampled data arrays will contain a total of $9,999 \\cdot 22071 = 220,687,929$ elements. Each element is stored in double precision (8-bytes), so at least 1.7GB will be used during the calculation. To relax the memory requirement, we'll process the data in batches of 100 resamples rather than all at once." ] }, { "cell_type": "code", "execution_count": 22, "id": "f6769306-af22-42b3-850b-377e40e3b703", "metadata": {}, "outputs": [ { "data": { "text/plain": [ "ConfidenceInterval(low=0.4309040629913994, high=0.6944371938603752)" ] }, "execution_count": 22, "metadata": {}, "output_type": "execute_result" } ], "source": [ "res = stats.bootstrap(data, statistic, method='percentile', confidence_level=0.95, batch=100)\n", "res.confidence_interval" ] }, { "cell_type": "markdown", "id": "19a3d153-fb2f-4f32-aa10-a347799c33e9", "metadata": { "tags": [] }, "source": [ "## Conclusion\n", "\n", "The resampling approaches in SciPy can be used not only to replicate the results of most of SciPy's hypothesis tests, but to\n", "\n", "- improve the accuracy of statistical tests for small sample sizes and in the presence of ties,\n", "- provide standard errors and confidence intervals for arbitrary statistics, and\n", "- easily implement statistical tests that SciPy does not yet offer." ] } ], "metadata": { "kernelspec": { "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.5" } }, "nbformat": 4, "nbformat_minor": 5 }