Last night I had the pleasure of presenting a talk for the PyData Boston Meetup. I presented a project I started earlier this summer, using data from the General Social Survey to measure and predict trends in religious affiliation and belief in the U.S.

The slides, which include the results so far and an overview of the methodology, are here:

And the code and data are all in this Jupyter notebook. I'll post additional results and discussion over the next few weeks.

Thanks to Milos Miljkovic, organizer of the PyData Boston Meetup, for inviting me, and to O'Reilly Media for hosting the meeting.

# Probably Overthinking It

A blog by Allen Downey.

## Wednesday, June 14, 2017

## Thursday, June 1, 2017

### Spring 2017 Data Science reports

In my Data Science class this semester, students worked on a series of reports where they explore a freely-available dataset, use data to answer questions, and present their findings. After each batch of reports, I will publish the abstracts here; you can follow the links below to see what they found.

###

One topic that enters popular discussion every four years is "who votes?" Every presidential election we see many discussions on which groups are more likely to vote, and which important voter groups each candidate needs to capture. One theme that is often part of this discussion is whether or not a candidate's biggest support is among groups likely to turn out. This analysis of the General Social Survey uses a number of different demographic variables to try and answer that question. Report

###

Using a dataset published by Medium on Kaggle, I explored the relationship between an employee's working conditions and the likelihood that they will quit their job. There were some expected trends, like lower salary leading to a higher attrition rate, but also some surprising ones, like having an accident at work leading to a lower likelihood of quitting! This observed information can be used by employers to determine the quitting probability of a specific individual, or to calculate the attrition rate of a larger group, like a department, and adjust their conditions accordingly.

###

Politics has often been a polarizing subject amongst Americans, and in today's increasingly partisan political environment, that has not changed. Using data from the General Social Survey (GSS), an annual study designed and conducted by the National Opinion Research Center (NORC) at the University of Chicago, we identify variables that are correlated with a person's political views. We find that while marital status has a statistically significant apparent effect on political views, that apparent effect is drastically reduced when including confounding variables, particularly religion. Report

###

In the 1990s, the USDA put out the image of a Food Guide Pyramid to help direct dietary choices. It grouped foods into six categories: grains, proteins (meats, fish, eggs, etc), vegetables, fruits, dairy, and fats and oils. Since then, the pyramid has been revamped in 2005, and then pushed towards a plate with five categories (oils were dropped) in the 2010s. The general population has learned of these basic food groups since grade school, and over time either fully adopts them into their lifestyles, or abandons them to pursue their own balanced diet. In light of the controversy surrounding the Food Pyramid, we decided to ask whether the food categories found in the Food Pyramid truly represent the correct groupings for food, and if not, just how far off are they? Using K-Means clustering on an extensive food databank, we created 6 groupings of food based on their macronutrient composition, which was the primary criteria the original Food Pyramid used in its categorization. We found that the K-Means groups only overlapped with existing food groups from the Food Pyramid 50% of the time, potentially suggesting that the idea of the basic food groups could be outdated. Report

###

It is well known fact that excessive amount of default from subprime mortgages, which are mortgages normally issued to a borrower of low credit, was a leading cause of subprime mortgage crisis that led to a global financial meltdown in 2007. Because of this nightmarish experience, it seems plausible to assume that current home mortgages are much harder to get and much more conservative (in terms of risks the lender is taking, shown mainly as an interest rate) than pre-2007 mortgages. Using a dataset containing all home mortgages purchased or guaranteed from The Federal Home Loan Mortgage Corporation, more commonly known as Freddie Mac, I investigate whether there is any noticeable difference between the interest rates before and after subprime mortgage crisis.

Report

###

Players in the NBA are often compared to others, both active and retired, based on similar play styles. For example, it is common to hear statements such as “Russell Westbrook is the new Derrick Rose”. The purpose of our project is to apply machine learning in the form of clustering to see which players are actually similar based on 22 variables. We successfully generated clusters of players that are very similar quantitatively. It is up to the reader to decide whether this is qualitatively true. Report

###

We can tell where a food is from - at least, culturally - from just a few bites. There are palettes of ingredients and spices which are strongly associated with each other - giving cajun cooking its kick, and french cuisine its "je ne sais quoi." But, what exactly these palettes and pairings are varies - ask ten different chefs, and you'll get six different answers. We look for a statistical way to identify "trinities" like "onion, carrot, celery" or "garlic, sesame oil, soy sauce," in the process both finding several associations not typically reflected in culinary literature and creating a tool which extends recipes based on their already-known ingredients, in a manner akin to a food version of a cell phone's autocomplete. Report

###

I examined the Pew News Coverage Index dataset from the years 2010 and 2012 to see how the different topics and stories were covered across media sectors and sources. The combined dataset had over 70,000 stories from all media sectors: print, online, cable tv, network tv, and broadcast radio. From the data, topics have less variance in word count and duration than sources. Report

### How Do You Predict Who Will Vote?

Sean CarterOne topic that enters popular discussion every four years is "who votes?" Every presidential election we see many discussions on which groups are more likely to vote, and which important voter groups each candidate needs to capture. One theme that is often part of this discussion is whether or not a candidate's biggest support is among groups likely to turn out. This analysis of the General Social Survey uses a number of different demographic variables to try and answer that question. Report

### Designing the Optimal Employee Experience... For Employers

Joey MaaloufUsing a dataset published by Medium on Kaggle, I explored the relationship between an employee's working conditions and the likelihood that they will quit their job. There were some expected trends, like lower salary leading to a higher attrition rate, but also some surprising ones, like having an accident at work leading to a lower likelihood of quitting! This observed information can be used by employers to determine the quitting probability of a specific individual, or to calculate the attrition rate of a larger group, like a department, and adjust their conditions accordingly.

^{Report}### Does being married have an effect on your political views?

Apurva Raman and William LuPolitics has often been a polarizing subject amongst Americans, and in today's increasingly partisan political environment, that has not changed. Using data from the General Social Survey (GSS), an annual study designed and conducted by the National Opinion Research Center (NORC) at the University of Chicago, we identify variables that are correlated with a person's political views. We find that while marital status has a statistically significant apparent effect on political views, that apparent effect is drastically reduced when including confounding variables, particularly religion. Report

### Should you Follow the Food Groups for Dietary Advice?

Kaitlyn Keil and Kevin ZhangIn the 1990s, the USDA put out the image of a Food Guide Pyramid to help direct dietary choices. It grouped foods into six categories: grains, proteins (meats, fish, eggs, etc), vegetables, fruits, dairy, and fats and oils. Since then, the pyramid has been revamped in 2005, and then pushed towards a plate with five categories (oils were dropped) in the 2010s. The general population has learned of these basic food groups since grade school, and over time either fully adopts them into their lifestyles, or abandons them to pursue their own balanced diet. In light of the controversy surrounding the Food Pyramid, we decided to ask whether the food categories found in the Food Pyramid truly represent the correct groupings for food, and if not, just how far off are they? Using K-Means clustering on an extensive food databank, we created 6 groupings of food based on their macronutrient composition, which was the primary criteria the original Food Pyramid used in its categorization. We found that the K-Means groups only overlapped with existing food groups from the Food Pyramid 50% of the time, potentially suggesting that the idea of the basic food groups could be outdated. Report

### Are Terms of Home Mortgage Less Favorable Now Compared to Pre Mortgage Crisis?

Sungwoo ParkIt is well known fact that excessive amount of default from subprime mortgages, which are mortgages normally issued to a borrower of low credit, was a leading cause of subprime mortgage crisis that led to a global financial meltdown in 2007. Because of this nightmarish experience, it seems plausible to assume that current home mortgages are much harder to get and much more conservative (in terms of risks the lender is taking, shown mainly as an interest rate) than pre-2007 mortgages. Using a dataset containing all home mortgages purchased or guaranteed from The Federal Home Loan Mortgage Corporation, more commonly known as Freddie Mac, I investigate whether there is any noticeable difference between the interest rates before and after subprime mortgage crisis.

Report

### Finding NBA Players with Similar Styles

Willem Thorbecke and David PappPlayers in the NBA are often compared to others, both active and retired, based on similar play styles. For example, it is common to hear statements such as “Russell Westbrook is the new Derrick Rose”. The purpose of our project is to apply machine learning in the form of clustering to see which players are actually similar based on 22 variables. We successfully generated clusters of players that are very similar quantitatively. It is up to the reader to decide whether this is qualitatively true. Report

### Food Trinities and Recipe Completion

Matt RuehleWe can tell where a food is from - at least, culturally - from just a few bites. There are palettes of ingredients and spices which are strongly associated with each other - giving cajun cooking its kick, and french cuisine its "je ne sais quoi." But, what exactly these palettes and pairings are varies - ask ten different chefs, and you'll get six different answers. We look for a statistical way to identify "trinities" like "onion, carrot, celery" or "garlic, sesame oil, soy sauce," in the process both finding several associations not typically reflected in culinary literature and creating a tool which extends recipes based on their already-known ingredients, in a manner akin to a food version of a cell phone's autocomplete. Report

### All the News in 2010 and 2012

Radmer van der HeydeI examined the Pew News Coverage Index dataset from the years 2010 and 2012 to see how the different topics and stories were covered across media sectors and sources. The combined dataset had over 70,000 stories from all media sectors: print, online, cable tv, network tv, and broadcast radio. From the data, topics have less variance in word count and duration than sources. Report

## Wednesday, April 26, 2017

### Python as a way of thinking

This article contains supporting material for this blog post at

I presented a longer version of this argument in a talk I presented at Olin College last fall. The slides are here:

Here are Jupyter notebooks with the code examples I mentioned in the talk:

A Counter is a natural representation of a multiset, which is a set where the elements can appear more than once. You can extend Counter with set operations like is_subset:

Suite is an abstract parent class; child classes should provide a likelihood method that evaluates the likelihood of the data under a given hypothesis.

*Scientific American*. The thesis of the post is that modern programming languages (like Python) are qualitatively different from the first generation (like FORTRAN and C), in ways that make them effective tools for teaching, learning, exploring, and thinking.I presented a longer version of this argument in a talk I presented at Olin College last fall. The slides are here:

Here are Jupyter notebooks with the code examples I mentioned in the talk:

- Breadth-first search in Python
- Using Counters, including the Bayesian update example.
- Introduction to PMFs, including the anagram example.
- Vectors, Frames, and Transforms.
- Cacophony for the Whole Family, an example from
*Think DSP*.

Here's my presentation at SciPy 2015, where I talked more about Python as a way of teaching and learning DSP:

Finally, here's the notebook "Using Counters", which uses Python's Counter object to implement a PMF (probability mass function) and perform Bayesian updates.

In [13]:

```
from __future__ import print_function, division
from collections import Counter
import numpy as np
```

A counter is a map from values to their frequencies. If you initialize a counter with a string, you get a map from each letter to the number of times it appears. If two words are anagrams, they yield equal Counters, so you can use Counters to test anagrams in linear time.

In [3]:

```
def is_anagram(word1, word2):
"""Checks whether the words are anagrams.
word1: string
word2: string
returns: boolean
"""
return Counter(word1) == Counter(word2)
```

In [4]:

```
is_anagram('tachymetric', 'mccarthyite')
```

Out[4]:

True

In [5]:

```
is_anagram('banana', 'peach')
```

Out[5]:

False

**Multisets**

A Counter is a natural representation of a multiset, which is a set where the elements can appear more than once. You can extend Counter with set operations like is_subset:

In [6]:

```
class Multiset(Counter):
"""A multiset is a set where elements can appear more than once."""
def is_subset(self, other):
"""Checks whether self is a subset of other.
other: Multiset
returns: boolean
"""
for char, count in self.items():
if other[char] < count:
return False
return True
# map the <= operator to is_subset
__le__ = is_subset
```

You could use

`is_subset`in a game like Scrabble to see if a given set of tiles can be used to spell a given word.
In [7]:

```
def can_spell(word, tiles):
"""Checks whether a set of tiles can spell a word.
word: string
tiles: string
returns: boolean
"""
return Multiset(word) <= Multiset(tiles)
```

In [8]:

```
can_spell('SYZYGY', 'AGSYYYZ')
```

Out[8]:

True

## Probability Mass Functions¶

You can also extend Counter to represent a probability mass function (PMF).`normalize`

computes the total of the frequencies and divides through, yielding probabilities that add to 1.`__add__`

enumerates all pairs of value and returns a new Pmf that represents the distribution of the sum.`__hash__`

and `__id__`

make Pmfs hashable; this is not the best way to do it, because they are mutable. So this implementation comes with a warning that if you use a Pmf as a key, you should not modify it. A better alternative would be to define a frozen Pmf.`render`

returns the values and probabilities in a form ready for plotting
In [9]:

```
class Pmf(Counter):
"""A Counter with probabilities."""
def normalize(self):
"""Normalizes the PMF so the probabilities add to 1."""
total = float(sum(self.values()))
for key in self:
self[key] /= total
def __add__(self, other):
"""Adds two distributions.
The result is the distribution of sums of values from the
two distributions.
other: Pmf
returns: new Pmf
"""
pmf = Pmf()
for key1, prob1 in self.items():
for key2, prob2 in other.items():
pmf[key1 + key2] += prob1 * prob2
return pmf
def __hash__(self):
"""Returns an integer hash value."""
return id(self)
def __eq__(self, other):
return self is other
def render(self):
"""Returns values and their probabilities, suitable for plotting."""
return zip(*sorted(self.items()))
```

As an example, we can make a Pmf object that represents a 6-sided die.

In [10]:

```
d6 = Pmf([1,2,3,4,5,6])
d6.normalize()
d6.name = 'one die'
print(d6)
```

Using the add operator, we can compute the distribution for the sum of two dice.

In [11]:

```
d6_twice = d6 + d6
d6_twice.name = 'two dice'
for key, prob in d6_twice.items():
print(key, prob)
```

Using numpy.sum, we can compute the distribution for the sum of three dice.

In [14]:

```
# if we use the built-in sum we have to provide a Pmf additive identity value
# pmf_ident = Pmf([0])
# d6_thrice = sum([d6]*3, pmf_ident)
# with np.sum, we don't need an identity
d6_thrice = np.sum([d6, d6, d6])
d6_thrice.name = 'three dice'
```

And then plot the results (using Pmf.render)

In [19]:

```
import matplotlib.pyplot as plt
%matplotlib inline
```

In [20]:

```
for die in [d6, d6_twice, d6_thrice]:
xs, ys = die.render()
plt.plot(xs, ys, label=die.name, linewidth=3, alpha=0.5)
plt.xlabel('Total')
plt.ylabel('Probability')
plt.legend()
plt.show()
```

## Bayesian statistics¶

A Suite is a Pmf that represents a set of hypotheses and their probabilities; it provides`bayesian_update`

, which updates the probability of the hypotheses based on new data.Suite is an abstract parent class; child classes should provide a likelihood method that evaluates the likelihood of the data under a given hypothesis.

`update_bayesian`

loops through the hypothesis, evaluates the likelihood of the data under each hypothesis, and updates the probabilities accordingly. Then it re-normalizes the PMF.
In [21]:

```
class Suite(Pmf):
"""Map from hypothesis to probability."""
def bayesian_update(self, data):
"""Performs a Bayesian update.
Note: called bayesian_update to avoid overriding dict.update
data: result of a die roll
"""
for hypo in self:
like = self.likelihood(data, hypo)
self[hypo] *= like
self.normalize()
```

As an example, I'll use Suite to solve the "Dice Problem," from Chapter 3 of

"Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?"

I'll start by making a list of Pmfs to represent the dice:

*Think Bayes*."Suppose I have a box of dice that contains a 4-sided die, a 6-sided die, an 8-sided die, a 12-sided die, and a 20-sided die. If you have ever played Dungeons & Dragons, you know what I am talking about. Suppose I select a die from the box at random, roll it, and get a 6. What is the probability that I rolled each die?"

I'll start by making a list of Pmfs to represent the dice:

In [31]:

```
def make_die(num_sides):
die = Pmf(range(1, num_sides+1))
die.name = 'd' + str(num_sides)
die.normalize()
return die
dice = [make_die(x) for x in [4, 6, 8, 12, 20]]
for die in dice:
print(die)
```

Next I'll define DiceSuite, which inherits

`bayesian_update`

from Suite and provides `likelihood`

.`data`

is the observed die roll, 6 in the example.`hypo`

is the hypothetical die I might have rolled; to get the likelihood of the data, I select, from the given die, the probability of the given value.
In [26]:

```
class DiceSuite(Suite):
def likelihood(self, data, hypo):
"""Computes the likelihood of the data under the hypothesis.
data: result of a die roll
hypo: Pmf object representing a die
"""
return hypo[data]
```

Finally, I use the list of dice to instantiate a Suite that maps from each die to its prior probability. By default, all dice have the same prior.

Then I update the distribution with the given value and print the results:

Then I update the distribution with the given value and print the results:

In [33]:

```
dice_suite = DiceSuite(dice)
dice_suite.bayesian_update(6)
for die in sorted(dice_suite):
print(len(die), dice_suite[die])
```

4 0.0 6 0.392156862745 8 0.294117647059 12 0.196078431373 20 0.117647058824

As expected, the 4-sided die has been eliminated; it now has 0 probability. The 6-sided die is the most likely, but the 8-sided die is still quite possible.

Now suppose I roll the die again and get an 8. We can update the Suite again with the new data

Now suppose I roll the die again and get an 8. We can update the Suite again with the new data

In [30]:

```
dice_suite.bayesian_update(8)
for die, prob in sorted(dice_suite.items()):
print(die.name, prob)
```

d4 0.0 d6 0.0 d8 0.623268698061 d12 0.277008310249 d20 0.0997229916898

Now the 6-sided die has been eliminated, the 8-sided die is most likely, and there is less than a 10% chance that I am rolling a 20-sided die.

These examples demonstrate the versatility of the Counter class, one of Python's underused data structures.

These examples demonstrate the versatility of the Counter class, one of Python's underused data structures.

In [ ]:

```
```

## Tuesday, April 4, 2017

### Honey, money, weather, terror

In my Data Science class this semester, students are working on a series of reports where they explore a freely-available dataset, use data to answer questions, and present their findings. After each batch of reports, I will publish the abstracts here; you can follow the links below to see what they found.

Using the National Health Interview Survey data, I was able to look for a potential link between military status and financial status. More specifically, I wanted to check if whether someone served in the United States military affects their current income bracket. It was apparent that people who served in the military were underrepresented in low income brackets and overrepresented in high income brackets compared to the rest of the population. This difference appears more clearly if we group the income data into even broader brackets for further analysis; being in the military increased one's chances of being in the upper half of respondents by 16.06 percentage points, and of being in the upper third by 14.64 percentage points. Further statistical analysis reported a Cohen effect size of 0.32, which is above the standard threshold to be considered more than a small effect.

Report

In the so-called "War on Drugs", one of the primary tactics is teaching children to "Just Say No!" However, less attention is paid to treatment for those who are already addicted. Except for the occasional comment on how a celebrity disappeared off to rehab, there is a silence in our culture about the apparently shameful act of getting treatment. This silence made me begin to wonder: how many people who struggle with addictions actually get treated, and how long does it take before they receive this help? Using the National Survey on Drug Use and Health data from 2014, I found that very few people who use drugs report getting treatment or counseling, and the length of time they go without getting treatment isn't particularly correlated with other factors.

Report

With Trump's recent travel ban and the escalation of controversial actions against Middle Eastern people, there has been a rise of paranoia towards the Middle East region for fear of the possibility of a terrorist attack. But is there is a reason to be so afraid of the Middle East, or even terrorism in general? What is the chance that an American would be a victim to terrorism? This article looks into just how likely the average person in the US will be affected by a terrorist attack, should one happen. Results show that the chance of a person being affected by terrorism in the North American region is almost 0, especially when compared to the probability in the Middle East itself. The data suggests that people's fears are unfounded and that the controversial reactions towards Middle East citizens because of a 1 in 15 million chance are irrational.

Report

Using a dataset containing over 4 million Facebook posts from 15 mainstream news outlets, I investigate the existence of seasonality in the number of likes a Facebook post from a news outlet gets. The dataset contains contents and attributes, such as number of likes and timestamp, of all facebook posts posted by the top media sources from 2012 to 2016. The media outlets included in the data are ABC, BBC, CBS, CNN, Fox & Friends, Fox, LA Times, NBC, NPR, The Huffington Post, The New York Times, The Wall Street Journal, The Washington Post, Time, and USA Today.

Report

The goal of this article is to explore the association between drug usage and depression. Intuitively, many would argue that those who use drugs are more likely to be depressed. To explore this relationship, we took data from the National Drug Usage and Health Survey from 2014. We conducted logistical regression on cocaine, marijuana, alcohol, and heroin while controlling for possible confounding variables such as sex, income, and health conditions. Surprisingly, there appears to be a negative correlation between drug usage and depression.

Report

We examine the historic honey-producing bee colony counts, yield, and honey production of states as collected by the USDA, finding statistical evidence for regional "clustering" of production and a negative correlation between per-hive yield and overall price, most strongly reflected in states with the greatest absolute production

Report

### The Impact of Military Status on Income Bracket

Joey MaaloufUsing the National Health Interview Survey data, I was able to look for a potential link between military status and financial status. More specifically, I wanted to check if whether someone served in the United States military affects their current income bracket. It was apparent that people who served in the military were underrepresented in low income brackets and overrepresented in high income brackets compared to the rest of the population. This difference appears more clearly if we group the income data into even broader brackets for further analysis; being in the military increased one's chances of being in the upper half of respondents by 16.06 percentage points, and of being in the upper third by 14.64 percentage points. Further statistical analysis reported a Cohen effect size of 0.32, which is above the standard threshold to be considered more than a small effect.

Report

### Getting Treatment

Kaitlyn KeilIn the so-called "War on Drugs", one of the primary tactics is teaching children to "Just Say No!" However, less attention is paid to treatment for those who are already addicted. Except for the occasional comment on how a celebrity disappeared off to rehab, there is a silence in our culture about the apparently shameful act of getting treatment. This silence made me begin to wonder: how many people who struggle with addictions actually get treated, and how long does it take before they receive this help? Using the National Survey on Drug Use and Health data from 2014, I found that very few people who use drugs report getting treatment or counseling, and the length of time they go without getting treatment isn't particularly correlated with other factors.

Report

### What's the Chance You will Die Due to Terrorism?

Kevin ZhangWith Trump's recent travel ban and the escalation of controversial actions against Middle Eastern people, there has been a rise of paranoia towards the Middle East region for fear of the possibility of a terrorist attack. But is there is a reason to be so afraid of the Middle East, or even terrorism in general? What is the chance that an American would be a victim to terrorism? This article looks into just how likely the average person in the US will be affected by a terrorist attack, should one happen. Results show that the chance of a person being affected by terrorism in the North American region is almost 0, especially when compared to the probability in the Middle East itself. The data suggests that people's fears are unfounded and that the controversial reactions towards Middle East citizens because of a 1 in 15 million chance are irrational.

Report

### Is There a Seasonality in the Response to Social Media Posts?

Sungwoo ParkUsing a dataset containing over 4 million Facebook posts from 15 mainstream news outlets, I investigate the existence of seasonality in the number of likes a Facebook post from a news outlet gets. The dataset contains contents and attributes, such as number of likes and timestamp, of all facebook posts posted by the top media sources from 2012 to 2016. The media outlets included in the data are ABC, BBC, CBS, CNN, Fox & Friends, Fox, LA Times, NBC, NPR, The Huffington Post, The New York Times, The Wall Street Journal, The Washington Post, Time, and USA Today.

Report

### The Association Between Drug Usage and Depression

David Papp and Willem ThorbeckeThe goal of this article is to explore the association between drug usage and depression. Intuitively, many would argue that those who use drugs are more likely to be depressed. To explore this relationship, we took data from the National Drug Usage and Health Survey from 2014. We conducted logistical regression on cocaine, marijuana, alcohol, and heroin while controlling for possible confounding variables such as sex, income, and health conditions. Surprisingly, there appears to be a negative correlation between drug usage and depression.

Report

### US Apiculture and Honey Production

Matthew Ruehle and Sean CarterWe examine the historic honey-producing bee colony counts, yield, and honey production of states as collected by the USDA, finding statistical evidence for regional "clustering" of production and a negative correlation between per-hive yield and overall price, most strongly reflected in states with the greatest absolute production

Report

### Most Terrorism is Local

Radmer van der Heyde
I explored the Global Terrorism Database to see how terror has evolved over time, and whether international terrorism has any defining features. Over time terrorism has increased and gotten deadlier, but shifted regions. However, international terrorism was too small a percentage of the dataset to reach an appropriate conclusion.

### Does it get warmer before it rains?

Apurva Raman and William Lu

Speculating about the weather has been a staple of small talk and human curiosity for a long time, and as a result, many weather "myths" exist. One such myth we’ve heard is that it gets warmer before a precipitation event (e.g. rain, snow, hail, sleet, etc.) occurs. Using data from the US National Oceanic and Atmospheric Association (NOAA), we find that change in temperature is a poor indicator for whether or not there will be a precipitation event.

## Wednesday, March 8, 2017

### Money, Murder, the Midwest, and More

In my Data Science class this semester, students are working on a series of reports where they explore a freely-available dataset, use data to answer questions, and present their findings. After each batch of reports, I will publish the abstracts here; you can follow the links below to see what they found.

As tensions over immigration increase with Europe dealing with a huge influx of refugees, some countries are more ready to accept immigrants while others close their borders to them. To understand the opinions of Europeans on immigration of particular groups, we investigated if respondents from different countries in Europe have consistent opinions toward Jews, Muslims, and Gypsies. We found that countries with a strong preference against Jews or Gypsies will also not prefer the other group. This does not hold true for Jews; countries that are willing to allow Jews are not necessarily willing to allow Muslims or Gypsies. Countries that are not accepting of Jews are not accepting of any of these three groups. However, they all preferred Jews to Muslims and Muslims to Gypsies.

Report

It is often rumored that colleges in the Midwest prefer ACT scores while colleges in other regions prefer SAT Scores. The goal of this article is to explore the relationship between SAT and ACT scores in the Midwest and other regions. The data used was collected from the US Department of Education for the years 2014-15. Comparing just the means of ACT scores shows that the Midwest scores slightly higher on average: 23.48 vs 23.17. However, a better statistic might be to compare the ratio of ACT/SAT scores. The Midwest has a slightly higher ratio (0.969) than other regions (0.960). Although we cannot deduce any causation, we can draw inferences as to what causes these differences. One explanation might be the fact that students applying to Midwestern colleges spend more time studying for the ACT.

Report

With issues like a growing wage gap, racism, and feminism at the front of our nation's attention, it can seem that the wealthy only care about getting more wealth, while equality only matters to those who are disadvantaged. However, the results of the European Social Survey of 2014 suggests that those with money do not value wealth a significant amount more than those of lower income brackets, and equality is not only valued at the same level across income brackets, but is consistently rated as more important than wealth.

Report

In the political arena, liberals often call their conservative counterparts "ignorant" because they believe that the other party doesn't know the facts, that they just don't know what's going on in the world. This would suggest that being more informed on political news and current events would make one more liberal. But does it really matter though? Does being more informed about politics make a person more liberal? Does it matter at all on how people end up voting? This article will decide whether being an informed individual truly results in believing a more liberal platform, or whether this notion is just a mislead stereotype meant as a mudsling tactic. Data analytics show that apparently a person has a high chance of holding the same opinion regardless of whether they are informed individuals or not. However, it seems that rather than leaning towards liberals, being more informed has the potential to make people more polarized towards either side and have stronger opinions on various political topics in general. While being more informed might not lead an increase in liberal thoughts, it might very well make people better able to cast a more thoughtful and representative vote.

Report

##
Money

Sungwoo Park

Does money buy you happiness? It's a decades old question that people have been wondering about. Data from the General Social Survey on the respondent's income and happiness level seem to suggest that people with high income tend to be happier than people with low income. Also, the data show that people with high income value the feeling of accomplishment and the importance of the job in their work more than people with low income do.

Report

The motivating question was to find out whether or not there existed a connection between the salary of an NBA player and his performance in the league. Using the statistic Player Efficiency Rating (PER), an NBA statistic commonly used to measure a player's overall performance in the league, I compared player salaries and performances. With a correlation of 0.5 between salaries and PERs across the leauge, as well as a Spearman Correlation of 0.4, I came to the conclusion that there was a slight correlation between the two variables, and thus higher paid NBA players may be deserving of their paychecks.

Report

We examine the claims made in an Economist article on prison tattoos. Examining a publicly-available inmate database, we found that there are several noticeable trends between tattoos and types of criminal conviction. Our results are not necessarily causative, and may reflect either societal biases or demographic trends. Nonetheless, the data demonstrates a strong correlation between different categories of "ink" and criminal classifications.

Report

As costs to attend college increase, an increasing number of high school seniors are left wondering if they should or must select a more affordable college. Many Americans go to college not just to gain a higher education, but also to increase their earning potential later in life. Using US Department of Education College Scorecard data, I found that going to a more expensive college could potentially make you more money in the future, that more selective colleges don't necessarily cost more, and that more selective colleges don't necessarily make you more money in the future.

Report

In this report, I sought to answer the question: does heart disease have seasonality like that of Influenza? To answer this, I explored the CDC's Wonder database on the underlying causes of death on the monthly data for the state of California. Based on my results, the majority of heart diseases show some seasonality as the dominant frequency component is at the frequency corresponding to a period of 1 year.

Report

## How do Europeans feel about Jewish, Muslim, and Gypsy immigration?

Apurva Raman and Celina BekinsAs tensions over immigration increase with Europe dealing with a huge influx of refugees, some countries are more ready to accept immigrants while others close their borders to them. To understand the opinions of Europeans on immigration of particular groups, we investigated if respondents from different countries in Europe have consistent opinions toward Jews, Muslims, and Gypsies. We found that countries with a strong preference against Jews or Gypsies will also not prefer the other group. This does not hold true for Jews; countries that are willing to allow Jews are not necessarily willing to allow Muslims or Gypsies. Countries that are not accepting of Jews are not accepting of any of these three groups. However, they all preferred Jews to Muslims and Muslims to Gypsies.

Report

## Do Midwestern colleges have better ACT scores?

David PappIt is often rumored that colleges in the Midwest prefer ACT scores while colleges in other regions prefer SAT Scores. The goal of this article is to explore the relationship between SAT and ACT scores in the Midwest and other regions. The data used was collected from the US Department of Education for the years 2014-15. Comparing just the means of ACT scores shows that the Midwest scores slightly higher on average: 23.48 vs 23.17. However, a better statistic might be to compare the ratio of ACT/SAT scores. The Midwest has a slightly higher ratio (0.969) than other regions (0.960). Although we cannot deduce any causation, we can draw inferences as to what causes these differences. One explanation might be the fact that students applying to Midwestern colleges spend more time studying for the ACT.

Report

## Rich or Poor: To Whom does it Matter More?

Kaitlyn KeilWith issues like a growing wage gap, racism, and feminism at the front of our nation's attention, it can seem that the wealthy only care about getting more wealth, while equality only matters to those who are disadvantaged. However, the results of the European Social Survey of 2014 suggests that those with money do not value wealth a significant amount more than those of lower income brackets, and equality is not only valued at the same level across income brackets, but is consistently rated as more important than wealth.

Report

## Do More Politically Informed People Identify as Liberal?

Kevin ZhangIn the political arena, liberals often call their conservative counterparts "ignorant" because they believe that the other party doesn't know the facts, that they just don't know what's going on in the world. This would suggest that being more informed on political news and current events would make one more liberal. But does it really matter though? Does being more informed about politics make a person more liberal? Does it matter at all on how people end up voting? This article will decide whether being an informed individual truly results in believing a more liberal platform, or whether this notion is just a mislead stereotype meant as a mudsling tactic. Data analytics show that apparently a person has a high chance of holding the same opinion regardless of whether they are informed individuals or not. However, it seems that rather than leaning towards liberals, being more informed has the potential to make people more polarized towards either side and have stronger opinions on various political topics in general. While being more informed might not lead an increase in liberal thoughts, it might very well make people better able to cast a more thoughtful and representative vote.

Report

##
Money *might* buy you some happiness

Sungwoo Park
Does money buy you happiness? It's a decades old question that people have been wondering about. Data from the General Social Survey on the respondent's income and happiness level seem to suggest that people with high income tend to be happier than people with low income. Also, the data show that people with high income value the feeling of accomplishment and the importance of the job in their work more than people with low income do.

Report

## Higher paid NBA players are (probably) deserving

Willem Thorbecke

Report

## Murder, Ink — A statistical analysis of tattoos in the Florida prison system

Joey Maalouf, Matthew Ruehle, Sean CarterWe examine the claims made in an Economist article on prison tattoos. Examining a publicly-available inmate database, we found that there are several noticeable trends between tattoos and types of criminal conviction. Our results are not necessarily causative, and may reflect either societal biases or demographic trends. Nonetheless, the data demonstrates a strong correlation between different categories of "ink" and criminal classifications.

Report

## Are more selective or expensive colleges worth it?

William LuAs costs to attend college increase, an increasing number of high school seniors are left wondering if they should or must select a more affordable college. Many Americans go to college not just to gain a higher education, but also to increase their earning potential later in life. Using US Department of Education College Scorecard data, I found that going to a more expensive college could potentially make you more money in the future, that more selective colleges don't necessarily cost more, and that more selective colleges don't necessarily make you more money in the future.

Report

## Are Diseases of the Heart Seasonal?

Radmer van der HeydeIn this report, I sought to answer the question: does heart disease have seasonality like that of Influenza? To answer this, I explored the CDC's Wonder database on the underlying causes of death on the monthly data for the state of California. Based on my results, the majority of heart diseases show some seasonality as the dominant frequency component is at the frequency corresponding to a period of 1 year.

Report

## Thursday, February 16, 2017

### A nice Bayes theorem problem: medical testing

On these previous post about my favorite Bayes theorem problems, I got the following comment from a reader named Riya:

I think this is an excellent question, so I am passing it along to the readers of this blog. One suggestion: you might want to use my world famous Bayesian update worksheet.

Hint: This question is similar to one I wrote about last year. In that article, I started with a problem that was underspecified; it took a while for me to realize that there were several ways to formulate the problem, with different answers.

Fortunately, the problem posed by Riya is completely specified; it is an example of what I called Scenario A, where there are two tests with different properties, and we don't know which test was used.

There are several ways to proceed, but I recommend writing four hypotheses that specify the test and the status of the patient:

TEST1 and sick

TEST1 and not sick

TEST2 and sick

TEST2 and not sick

For each of these hypotheses, it is straightforward to compute the prior probability and the likelihood of a positive test. From there, it's just arithmetic.

Here's what it looks like using my world famous Bayesian update worksheet:

(Now with more smudges because I had an arithmetic error the first time. Thanks, Ben Torvaney, for pointing it out.)

I have a question. Exactly 1/5th of the people in a town have Beaver Fever . There are two tests for Beaver Fever, TEST1 and TEST2. When a person goes to a doctor to test for Beaver Fever, with probability 2/3 the doctor conducts TEST1 on him and with probability 1/3 the doctor conducts TEST2 on him. When TEST1 is done on a person, the outcome is as follows: If the person has the disease, the result is positive with probability 3/4. If the person does not have the disease, the result is positive with probability 1/4. When TEST2 is done on a person, the outcome is as follows: If the person has the disease, the result is positive with probability 1. If the person does not have the disease, the result is positive with probability 1/2. A person is picked uniformly at random from the town and is sent to a doctor to test for Beaver Fever. The result comes out positive. What is the probability that the person has the disease?

I think this is an excellent question, so I am passing it along to the readers of this blog. One suggestion: you might want to use my world famous Bayesian update worksheet.

Hint: This question is similar to one I wrote about last year. In that article, I started with a problem that was underspecified; it took a while for me to realize that there were several ways to formulate the problem, with different answers.

Fortunately, the problem posed by Riya is completely specified; it is an example of what I called Scenario A, where there are two tests with different properties, and we don't know which test was used.

There are several ways to proceed, but I recommend writing four hypotheses that specify the test and the status of the patient:

TEST1 and sick

TEST1 and not sick

TEST2 and sick

TEST2 and not sick

For each of these hypotheses, it is straightforward to compute the prior probability and the likelihood of a positive test. From there, it's just arithmetic.

Here's what it looks like using my world famous Bayesian update worksheet:

After the update, the total probability that the patient is sick is 10/26 or about 38%. That's up from the prior, which was 1/5 or 20%. So the positive test is evidence that the patient is sick, but it is not very strong evidence.

Interestingly, the total posterior probability of TEST2 is 12/26 or about 46%. That's up from the prior, which was 33%. So the positive test provides some evidence that TEST2 was used.

## Monday, January 16, 2017

### Last batch of notebooks for Think Stats

Getting ready to teach Data Science in the spring, I am going back through

If you are reading the book, you can get the notebooks by cloning this repository on GitHub, and running the notebooks on your computer.

Or you can read (but not run) the notebooks on GitHub:

Chapter 13 Notebook (Chapter 13 Solutions)

Chapter 14 Notebook (Chapter 14 Solutions)

I am done now, just in time for the semester to start, tomorrow! Here are some of the examples from Chapter 13, on survival analysis:

Here's the distribution of pregnancy length in the NSFG dataset.

Compute the duration of marriages that have ended in divorce, and the duration, so far, of marriages that are ongoing. Estimate the hazard and survival curve for the duration of marriage.

Use resampling to take into account sampling weights, and plot data from several resamples to visualize sampling error.

Consider dividing the respondents into groups by decade of birth, and possibly by age at first marriage.

*Think Stats*and updating the Jupyter notebooks. Each chapter has a notebook that shows the examples from the book along with some small exercises, with more substantial exercises at the end.If you are reading the book, you can get the notebooks by cloning this repository on GitHub, and running the notebooks on your computer.

Or you can read (but not run) the notebooks on GitHub:

Chapter 13 Notebook (Chapter 13 Solutions)

Chapter 14 Notebook (Chapter 14 Solutions)

I am done now, just in time for the semester to start, tomorrow! Here are some of the examples from Chapter 13, on survival analysis:

## Survival analysis¶

If we have an unbiased sample of complete lifetimes, we can compute the survival function from the CDF and the hazard function from the survival function.Here's the distribution of pregnancy length in the NSFG dataset.

In [2]:

```
import nsfg
preg = nsfg.ReadFemPreg()
complete = preg.query('outcome in [1, 3, 4]').prglngth
cdf = thinkstats2.Cdf(complete, label='cdf')
```

The survival function is just the complementary CDF.

In [3]:

```
import survival
def MakeSurvivalFromCdf(cdf, label=''):
"""Makes a survival function based on a CDF.
cdf: Cdf
returns: SurvivalFunction
"""
ts = cdf.xs
ss = 1 - cdf.ps
return survival.SurvivalFunction(ts, ss, label)
```

In [4]:

```
sf = MakeSurvivalFromCdf(cdf, label='survival')
```

In [5]:

```
print(cdf[13])
print(sf[13])
```

0.13978014121 0.86021985879

Here's the CDF and SF.

In [6]:

```
thinkplot.Plot(sf)
thinkplot.Cdf(cdf, alpha=0.2)
thinkplot.Config(loc='center left')
```

And here's the hazard function.

In [7]:

```
hf = sf.MakeHazardFunction(label='hazard')
print(hf[39])
```

0.676706827309

In [8]:

```
thinkplot.Plot(hf)
thinkplot.Config(ylim=[0, 0.75], loc='upper left')
```

## Age at first marriage¶

We'll use the NSFG respondent file to estimate the hazard function and survival function for age at first marriage.
In [9]:

```
resp6 = nsfg.ReadFemResp()
```

We have to clean up a few variables.

In [10]:

```
resp6.cmmarrhx.replace([9997, 9998, 9999], np.nan, inplace=True)
resp6['agemarry'] = (resp6.cmmarrhx - resp6.cmbirth) / 12.0
resp6['age'] = (resp6.cmintvw - resp6.cmbirth) / 12.0
```

And the extract the age at first marriage for people who are married, and the age at time of interview for people who are not.

In [11]:

```
complete = resp6[resp6.evrmarry==1].agemarry.dropna()
ongoing = resp6[resp6.evrmarry==0].age
```

The following function uses Kaplan-Meier to estimate the hazard function.

In [12]:

```
from collections import Counter
def EstimateHazardFunction(complete, ongoing, label='', verbose=False):
"""Estimates the hazard function by Kaplan-Meier.
http://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator
complete: list of complete lifetimes
ongoing: list of ongoing lifetimes
label: string
verbose: whether to display intermediate results
"""
if np.sum(np.isnan(complete)):
raise ValueError("complete contains NaNs")
if np.sum(np.isnan(ongoing)):
raise ValueError("ongoing contains NaNs")
hist_complete = Counter(complete)
hist_ongoing = Counter(ongoing)
ts = list(hist_complete | hist_ongoing)
ts.sort()
at_risk = len(complete) + len(ongoing)
lams = pd.Series(index=ts)
for t in ts:
ended = hist_complete[t]
censored = hist_ongoing[t]
lams[t] = ended / at_risk
if verbose:
print(t, at_risk, ended, censored, lams[t])
at_risk -= ended + censored
return survival.HazardFunction(lams, label=label)
```

Here is the hazard function and corresponding survival function.

In [13]:

```
hf = EstimateHazardFunction(complete, ongoing)
thinkplot.Plot(hf)
thinkplot.Config(xlabel='Age (years)',
ylabel='Hazard')
```

In [14]:

```
sf = hf.MakeSurvival()
thinkplot.Plot(sf)
thinkplot.Config(xlabel='Age (years)',
ylabel='Prob unmarried',
ylim=[0, 1])
```

## Quantifying uncertainty¶

To see how much the results depend on random sampling, we'll use a resampling process again.
In [15]:

```
def EstimateMarriageSurvival(resp):
"""Estimates the survival curve.
resp: DataFrame of respondents
returns: pair of HazardFunction, SurvivalFunction
"""
# NOTE: Filling missing values would be better than dropping them.
complete = resp[resp.evrmarry == 1].agemarry.dropna()
ongoing = resp[resp.evrmarry == 0].age
hf = EstimateHazardFunction(complete, ongoing)
sf = hf.MakeSurvival()
return hf, sf
```

In [16]:

```
def ResampleSurvival(resp, iters=101):
"""Resamples respondents and estimates the survival function.
resp: DataFrame of respondents
iters: number of resamples
"""
_, sf = EstimateMarriageSurvival(resp)
thinkplot.Plot(sf)
low, high = resp.agemarry.min(), resp.agemarry.max()
ts = np.arange(low, high, 1/12.0)
ss_seq = []
for _ in range(iters):
sample = thinkstats2.ResampleRowsWeighted(resp)
_, sf = EstimateMarriageSurvival(sample)
ss_seq.append(sf.Probs(ts))
low, high = thinkstats2.PercentileRows(ss_seq, [5, 95])
thinkplot.FillBetween(ts, low, high, color='gray', label='90% CI')
```

The following plot shows the survival function based on the raw data and a 90% CI based on resampling.

In [17]:

```
ResampleSurvival(resp6)
thinkplot.Config(xlabel='Age (years)',
ylabel='Prob unmarried',
xlim=[12, 46],
ylim=[0, 1],
loc='upper right')
```

The SF based on the raw data falls outside the 90% CI because the CI is based on weighted resampling, and the raw data is not. You can confirm that by replacing

`ResampleRowsWeighted`

with `ResampleRows`

in `ResampleSurvival`

.## More data¶

To generate survivial curves for each birth cohort, we need more data, which we can get by combining data from several NSFG cycles.
In [18]:

```
resp5 = survival.ReadFemResp1995()
resp6 = survival.ReadFemResp2002()
resp7 = survival.ReadFemResp2010()
```

In [19]:

```
resps = [resp5, resp6, resp7]
```

The following is the code from

`survival.py`

that generates SFs broken down by decade of birth.
In [20]:

```
def AddLabelsByDecade(groups, **options):
"""Draws fake points in order to add labels to the legend.
groups: GroupBy object
"""
thinkplot.PrePlot(len(groups))
for name, _ in groups:
label = '%d0s' % name
thinkplot.Plot([15], [1], label=label, **options)
def EstimateMarriageSurvivalByDecade(groups, **options):
"""Groups respondents by decade and plots survival curves.
groups: GroupBy object
"""
thinkplot.PrePlot(len(groups))
for _, group in groups:
_, sf = EstimateMarriageSurvival(group)
thinkplot.Plot(sf, **options)
def PlotResampledByDecade(resps, iters=11, predict_flag=False, omit=None):
"""Plots survival curves for resampled data.
resps: list of DataFrames
iters: number of resamples to plot
predict_flag: whether to also plot predictions
"""
for i in range(iters):
samples = [thinkstats2.ResampleRowsWeighted(resp)
for resp in resps]
sample = pd.concat(samples, ignore_index=True)
groups = sample.groupby('decade')
if omit:
groups = [(name, group) for name, group in groups
if name not in omit]
# TODO: refactor this to collect resampled estimates and
# plot shaded areas
if i == 0:
AddLabelsByDecade(groups, alpha=0.7)
if predict_flag:
PlotPredictionsByDecade(groups, alpha=0.1)
EstimateMarriageSurvivalByDecade(groups, alpha=0.1)
else:
EstimateMarriageSurvivalByDecade(groups, alpha=0.2)
```

Here are the results for the combined data.

In [21]:

```
PlotResampledByDecade(resps)
thinkplot.Config(xlabel='Age (years)',
ylabel='Prob unmarried',
xlim=[13, 45],
ylim=[0, 1])
```

We can generate predictions by assuming that the hazard function of each generation will be the same as for the previous generation.

In [22]:

```
def PlotPredictionsByDecade(groups, **options):
"""Groups respondents by decade and plots survival curves.
groups: GroupBy object
"""
hfs = []
for _, group in groups:
hf, sf = EstimateMarriageSurvival(group)
hfs.append(hf)
thinkplot.PrePlot(len(hfs))
for i, hf in enumerate(hfs):
if i > 0:
hf.Extend(hfs[i-1])
sf = hf.MakeSurvival()
thinkplot.Plot(sf, **options)
```

And here's what that looks like.

In [23]:

```
PlotResampledByDecade(resps, predict_flag=True)
thinkplot.Config(xlabel='Age (years)',
ylabel='Prob unmarried',
xlim=[13, 45],
ylim=[0, 1])
```

## Remaining lifetime¶

Distributions with difference shapes yield different behavior for remaining lifetime as a function of age.
In [24]:

```
preg = nsfg.ReadFemPreg()
complete = preg.query('outcome in [1, 3, 4]').prglngth
print('Number of complete pregnancies', len(complete))
ongoing = preg[preg.outcome == 6].prglngth
print('Number of ongoing pregnancies', len(ongoing))
hf = EstimateHazardFunction(complete, ongoing)
sf1 = hf.MakeSurvival()
```

Number of complete pregnancies 11189 Number of ongoing pregnancies 352

Here's the expected remaining duration of a pregnancy as a function of the number of weeks elapsed. After week 36, the process becomes "memoryless".

In [25]:

```
rem_life1 = sf1.RemainingLifetime()
thinkplot.Plot(rem_life1)
thinkplot.Config(title='Remaining pregnancy length',
xlabel='Weeks',
ylabel='Mean remaining weeks')
```

And here's the median remaining time until first marriage as a function of age.

In [26]:

```
hf, sf2 = EstimateMarriageSurvival(resp6)
```

In [27]:

```
func = lambda pmf: pmf.Percentile(50)
rem_life2 = sf2.RemainingLifetime(filler=np.inf, func=func)
thinkplot.Plot(rem_life2)
thinkplot.Config(title='Years until first marriage',
ylim=[0, 15],
xlim=[11, 31],
xlabel='Age (years)',
ylabel='Median remaining years')
```

## Exercises¶

**Exercise:**In NSFG Cycles 6 and 7, the variable

`cmdivorcx`

contains the date of divorce for the respondent’s first marriage, if applicable, encoded in century-months.Compute the duration of marriages that have ended in divorce, and the duration, so far, of marriages that are ongoing. Estimate the hazard and survival curve for the duration of marriage.

Use resampling to take into account sampling weights, and plot data from several resamples to visualize sampling error.

Consider dividing the respondents into groups by decade of birth, and possibly by age at first marriage.

In [28]:

```
def CleanData(resp):
"""Cleans respondent data.
resp: DataFrame
"""
resp.cmdivorcx.replace([9998, 9999], np.nan, inplace=True)
resp['notdivorced'] = resp.cmdivorcx.isnull().astype(int)
resp['duration'] = (resp.cmdivorcx - resp.cmmarrhx) / 12.0
resp['durationsofar'] = (resp.cmintvw - resp.cmmarrhx) / 12.0
month0 = pd.to_datetime('1899-12-15')
dates = [month0 + pd.DateOffset(months=cm)
for cm in resp.cmbirth]
resp['decade'] = (pd.DatetimeIndex(dates).year - 1900) // 10
```

In [29]:

```
CleanData(resp6)
married6 = resp6[resp6.evrmarry==1]
CleanData(resp7)
married7 = resp7[resp7.evrmarry==1]
```

In [30]:

```
# Solution
def ResampleDivorceCurve(resps):
"""Plots divorce curves based on resampled data.
resps: list of respondent DataFrames
"""
for _ in range(11):
samples = [thinkstats2.ResampleRowsWeighted(resp)
for resp in resps]
sample = pd.concat(samples, ignore_index=True)
PlotDivorceCurveByDecade(sample, color='#225EA8', alpha=0.1)
thinkplot.Show(xlabel='years',
axis=[0, 28, 0, 1])
```

In [31]:

```
# Solution
def ResampleDivorceCurveByDecade(resps):
"""Plots divorce curves for each birth cohort.
resps: list of respondent DataFrames
"""
for i in range(41):
samples = [thinkstats2.ResampleRowsWeighted(resp)
for resp in resps]
sample = pd.concat(samples, ignore_index=True)
groups = sample.groupby('decade')
if i == 0:
survival.AddLabelsByDecade(groups, alpha=0.7)
EstimateSurvivalByDecade(groups, alpha=0.1)
thinkplot.Config(xlabel='Years',
ylabel='Fraction undivorced',
axis=[0, 28, 0, 1])
```

In [32]:

```
# Solution
def EstimateSurvivalByDecade(groups, **options):
"""Groups respondents by decade and plots survival curves.
groups: GroupBy object
"""
thinkplot.PrePlot(len(groups))
for name, group in groups:
_, sf = EstimateSurvival(group)
thinkplot.Plot(sf, **options)
```

In [33]:

```
# Solution
def EstimateSurvival(resp):
"""Estimates the survival curve.
resp: DataFrame of respondents
returns: pair of HazardFunction, SurvivalFunction
"""
complete = resp[resp.notdivorced == 0].duration.dropna()
ongoing = resp[resp.notdivorced == 1].durationsofar.dropna()
hf = survival.EstimateHazardFunction(complete, ongoing)
sf = hf.MakeSurvival()
return hf, sf
```

In [34]:

```
# Solution
ResampleDivorceCurveByDecade([married6, married7])
```

In [ ]:

```
```

Subscribe to:
Posts (Atom)