The Ashley Madison hack continues to unfold, as so many of these stories do, with thousands of journalists and other interested parties sorting the data.
The data itself—today's new data dump excepted—is not very complicated. There is a member database showing anyone who has ever signed up for the service and then there are daily transaction records from a corporate server. The latter data tracks paying users, the people who gave money to the site so that they could send messages. (Receiving messages is free.) We focused on these customers because we figured these were the people who were serious about using the site.
We had a simple question: Were people in some states more likely to pay for Ashley Madison than people in other states? Before we go into the methodology, let's just be clear that there were wide variations between states.
So who was on top as the Ashley Madisoniest state? Well, I hate to say you'd expect this but… It's Jersey. The Garden State is followed by our nation's capital (of course), and Connecticut. Massachusetts, Colorado, New Hampshire, Virginia, Utah, New York, and Maryland round out your top 10.
I see you there Utah. I see you.
And here are the least Ashley Madisoniest from #51 to #41: West Virginia, Mississippi, Arkansas, Maine, Kentucky, Iowa, Tennessee, Alabama, South Dakota. Gotta say: lot of red states in that list.
But—perhaps more importantly—there are a lot of poor states on the list, too. West Virginia, Mississippi, Arkansas, Kentucky, and Alabama rank among the poorest states in the country, year in and year out. And disposable income has got to play some role in the likelihood of a person to use a paid service to seek an affair.
It's worth noting that the variations between states are quite significant from top to bottom. We had unique IDs for 0.82% of New Jersey's over-18 population. Almost 1 percent. The median state, which of course is Nebraska, you're looking at 0.49%. And down at West Virginia, we're talking 0.28%. So based on this data, a New Jersey resident was almost 3 times more likely to use Ashley Madison than someone from West Virginia.
How did we do these calculations and make the map? It wasn't that hard, but it took some time. All of the transaction data is very similar and amenable to machine manipulation. With the credit card transactions in particular, each row of data consists of several transaction tracking numbers, a name, the last four digits of a credit card, and an address.
But there are several thousand daily documents, each one containing several thousand records. That's millions of rows of data. Add it all up and we're talking a *text file* that is more than a couple gigabytes. So many millions that the data takes on almost physical qualities—it's easier to move by thumb drive than across the Internet, and doing things with it can take a while on the human time scale. It's not the kind of thing you can drop into Excel and simply start combing through.
So, here's what we did. First, we concatenated all the individual transaction files into one big file that we could manipulate (alldata.csv). Then we (or rather Fusion's Daniel McLaughlin) wrote a Python script that created a ranked list of states by the number of transactions in the database. But what we were really after was the number of people — so we de-duplicated the data based on names and the last-four digits of the credit card number. That let us isolate the number of unique people represented in the cache of paying customers.
But, of course, the states with the most people in the database were just the biggest states — California, Texas, New York, and Florida. So, we took the over-18 populations of the 50 states and the District of Columbia and divided our number of Ashley Madison people by the total adult population of each state to arrive at a per-capita number. FWIW, there turned out to be roughly 5.6 payments per person in the data with some variation between states (min: 4.9, max: 6.5).
Having seen a lot of this data first hand, I would not say this is the cleanest data set in the world. We know a few sources of error. One, we de-duped on a state-by-state basis, so there are probably some users who paid from different states, and therefore are showing up on two states' counts here. Two, many people paid with gift cards, and so their addresses could be completely false. Three, there are clearly a lot of made-up addresses in the data.
Beyond the state map, the first thing that stands out in this data is the relatively small number of people who appear in the paying records. By our method, we got 1.3 million unique American paying customers stretching back all the way to 2008. But all kinds of stories have cited 37 million users for the site. So, the site clearly has many unpaid users (who wouldn't be included in our credit card transaction data). Only one side of a conversation on the site has to pay, so, we've heard that women, for example, basically used the site for free. But it may also mean that the vast majority of users just created an account to see what a site for cheaters looked like, but didn't ever use it or even intend to use it.