Take that Data with a Grain of Salt

I wanted to geographically visualize Twitter activity by zip code. I had a set of Tweets captured from GNIP Powertrack. Each tweet has the author’s public profile, as well as the tweet text and several other fields.

First, I got the zip code data: about 42,500 zip codes each with lat/lon coordinates and population estimate from US 2010 Census. Next, I loaded it into a Hive table. I used Hive because the tweet data was big enough to live in Hadoop, though you could do the geo zip code manipulation I’m describing in any database.

Next, from the Twitter data I found the closest zip code to each author by computing the distance from the author’s lat/lon to the lat/lon of every zip code. Of course, this only works for people who put their location in their Twitter public profile, but that was surprisingly many people. I only needed the closest, which simplified the computation since units didn’t matter.

Next, I computed metrics (like tweet counts) per zip code. Raw numbers would not be useful – of course most of the tweets come from the zips with the most people. I divided by zip code population to get per-capita values.

At this point I should have been done, in which case it would be so simple and quick there would be no reason to write about it here. However, that’s not what happened. These values were all over the map: spanning 5 orders of magnitude (0 – 10^5). Per capita?! This didn’t make sense. I expected values of zero or close to zero, but reasonable maximums would be a fraction of 1 – nowhere near the tens of thousands.

Investigation revealed the outliers were tiny zip codes like 30334 covering about 1 square block having tiny populations like 1. Whenever a prolific tweeter (or tweet-bot) resides in one of these it blows out the data. Clearly, these are zip codes too new to have accurate 2010 Census data.

The problem is actually more general: any small zip code can blow out the data – even if its population data is accurate. I needed a way to factor out these tiny zip codes. One way to do this is to merge each small zip code into its nearest big zip code. Split zip codes into 2 groups: small and big, using a threshold population. I made a histogram of zip codes by population to see what would be a good split value. I chose 25-75 since 25% is about 10,000 zip codes, which is enough for smooth dense coverage. The histogram showed a population of about 8,000 would give this 25-75 split.

Next, I found the nearest big zip for each small zip. In SQL this is a full outer join between the big and small zip sets, computing distance between each, then picking min(distance) for each small zip. With roughly 10,000 big zips and 30,000 small zips, the cross product would have about 300 MM rows.

Now for a brief aside on Spark-SQL versus Hive. In most of my work, Spark-SQL is about 10x faster than Hive. Yet when I ran this join, Spark-SQL was going to take 18+ hours! I copied the data to Hadoop and ran it in Hive, and it took 30 minutes. So it appears Spark-SQL is still new enough they have some work to do optimizing certain kinds of joins.

Now, SQL makes it hard to select which big zip the minimum distance came from. That is, you can select zip_small, min(distance) and group by zip_small. But it doesn’t tell you which big zip that minimum came from. Once you know that minimum distance, you can select the small zip and distance from the above cross-product table, but this won’t necessarily be unique. Each small zip has about 10,000 big zips, and it might be the exact same distance from several.
this duplication explodes the join.

The textbook workaround would be to rank and select by rank within distance, but that would be complex. Instead, I thought of a simple hack: force all the distances to be unique, so you can select the small zip and min distance from the cross-product table without any duplicates exploding your join.

To make the distances unique, sprinkle some random bits into them, and do it in the least significant bits so it doesn’t affect which is closest. Put differently, salt the distances with just enough salt to ensure they’re unique, and push the salt down into the least significant bits so it doesn’t affect comparisons.

Doing this is simple. First, query to find how close together are the closest distances. In my case it was about 0.1 (unitless measure). Then, generate a random float between 0 and 1, multiply it by 1/10 of that figure to be safe, and add it to each computed distance. Because computers generate random numbers in sequences, getting a duplicate is statistically improbable, so every distance will be unique, and you scaled it small enough (in my case by multiplying by 0.01) it won’t change each zip’s closest neighbor.

Next, for each big zip, sum the populations of all its small neighbors and add this sum to the big zip’s population to get the population of the big zip with all its neighboring small zips.

Now we have a table that has: zip_small, zip_big, pop_sum. Map each tweet’s author’s zip to zip_small, then report it as zip_big and divide by zip_sum to get per capita.

Of course, we need to add big zips to this table too, so each big zip maps to itself with the population sum.

This worked beautifully and smoothed the data.

NOTE: there are many applications where adding a little random noise is helpful. This is just one example, and a simple one.