Category Archives: Tech

Why Ignore Unique Words in Vector Spaces?

Lately I’ve been working on natural language processing, learning as I go. My first project is to discover topics being discussed in a set of documents and group the docs by these topics. So I studied document similarity and discovered a Python library called GenSim which builds on top of numpy and scipy.

The tutorial starts by mapping documents to points in a vector space. Before we map we do some basic text processing to reduce “noise” – for example stripping stop words. The tutorial casually mentions removing all words that appear only once in the corpus. It’s not obvious to me why one would do this, and the tutorial has no explanation. I did a bunch of googling and found this is commonly done, but could not find any explanation why. Then I thought it about a little more and I think I know why.

We map words into a vector space having one dimension per distinct word in the corpus. A document’s value or position along each dimension (word) is computed. It could be the simple number of times that word appears in that doc – this is the bag of words approach. It could be the TF-IDF for that word in that document, which is more complex to compute but could provide better results, depending on what you are doing. However you define it, you end up with a vector for each document.

Once you have this, you can do lots of cool stuff. But it’s this intuitive understanding of what the vector space actually is, that makes it clear why we would remove or ignore all words that appear only once in the corpus.

One way to compute document similarity is to measure the angle between the vectors representing documents. Basically, are they pointing in the same direction? The smaller the angle between them, the more similar they are. This is where cosine similarity comes from. If the angle is 0, cosine is 1: they point in the exact same direction. If the angle is 180, cosine is -1: they point in opposite directions. Cosine of 0 means they are orthogonal. The closer to 1 the cosine is, the more similar they are.

Of course, no matter how many dimensions the vector space has, the angle between any 2 vectors lies in a 2-D plane – it can be expressed as a single number.

Let’s take a 3-D example: we have 3 words: brown (X axis), swim (Y axis), monkey (Z axis). Across our corpus of many documents, suppose only 1 doc (D1) has the word monkey. The other words each appear in several documents. That means the vectors for every document except D1 lie entirely in the X-Y plane – their Z component is 0. D1 is the only document whose vector sticks out from the X-Y plane.

Now it becomes easy to see why the word monkey does not contribute to similarity. Take any 2 vectors in this example. If both are in the X-Y plane then it’s obvious that the Z axis has no impact on the angle between them. If only one is in the X-Y plane (call it Dx), it means the other (not in the X-Y plane) must be D1. Here, the angle between D1 and Dx is different from the angle between Dx and the projection or shadow of D1 onto the X-Y plane. But, it doesn’t matter because this is true when comparing D1 to every other vector in the set. The relative differences between D1 and each other vector in the set are the same whether we use D1 or the projection of D1 onto the X-Y plane. In other words, using cosine similarity they still rank in the same order nearest to furthest.

Another way to see this is to consider the vector dot product between D1 and D2. As a reminder, the dot product is the sum of the products of each vector’s components in each dimension. Any dimension that has a value of 0 in either vector contributes nothing to the dot product. And of course, every vector except D1 has a 0 for the Z dimension, so the Z component of the dot product will always be 0. The cosine of the angle between any 2 vectors Dj and Dk is equal to their dot product divided by the product of their magnitudes. If we normalize all vectors to unit length, the denominator is always 1 and cosine is the dot product.

Because of this, any word that appears exactly once in the corpus can be ignored. It has no effect on the similarity of documents. But we can actually make a stronger statement: any word that appears in a single document can be ignored, no matter how many times it appears in that document. This is a misleading – yet not incorrect – part of the tutorial. It removes words that appear only once in the corpus. It could go further and remove words that appear in only 1 document, even if they occur multiple times.

I haven’t found any explanation for this at all, so I can’t confirm my explanation. But I suspect this is why once-occurring words are often ignored. In fact, sometimes people get better results by ignoring words that occur in more than 1 document, if they occur only in a very small number of docs. The reasoning seems to be that words appearing in a handful of docs from a corpus of thousands or millions, have negligible impact on similarity measures. And each word you ignore reduces the dimensionality of the computations.

Ubuntu Linux and Blu-Ray

Getting Linux to work with Blu-Ray took some custom configuration. The state of Linux and Blu-Ray has much to be desired and doesn’t work out of the box. But it can be made to work if you know what to do. Here’s how I got it to work.

Reading Blu-Rays

This was the easy part. You can do it in 2 ways: VLC and MakeMKV

Blu-Rays don’t play in VLC because of DRM. To play them in VLC you need to download a file of Blu-Ray keys, like here: http://vlc-bluray.whoknowsmy.name/. This may not be the best approach because the file is static. New Blu-Rays are coming out all the time. But it works if you regularly update this file and it has the key for the Blu-Ray you want to play.

MakeMKV is software that reads the data from a Blu-Ray and can write it to your hard drive as an MKV file. It can also stream the Blu-Ray to a port on your local machine. Then you can connect VLC to play the stream from that port. Viola! You can watch the Blu-Ray on your computer with VLC, even if you don’t have the keys file. MakeMKV is shareware – free for the first 30 days, then you should pay for it.

Writing Blu-Rays

The first challenge writing Blu-Rays is Ubuntu’s built-in CD writing software, cdrecord. It’s a very old buggy version. This happens even with the latest repos on Ubuntu 15.10. It works fine for Audio CDs, data CDs and DVDs. But not for Blu-Ray. The first step is to replace it with a newer, up-to-date version. The one I used is CDRTools from Brandon Snider: https://launchpad.net/~brandonsnider/+archive/ubuntu/cdrtools.

Whatever front end you use to burn disks (like K3B) works just the same as before, since it uses the apps from the underlying OS, which you’ve now replaced. After this change I could reliably burn dual-layer (50 GB) Blu-Rays on my Dell / Ubuntu 15.10 desktop using K3B. My burner is an LG WH16NS40. It is the bare OEM version and works flawlessly out of the box.

Now you can burn a Blu-Ray, but before you do that you need to format the video & audio and organize into files & directories that a Blu-Ray player will recognize as a Blu-Ray disc. What I’m about to describe works with my audio system Blu-Ray player, an Oppo BDP-83.

The command-line app tsmuxer does this. But it’s a general transcoder that can do more than Blu-Ray, and the command line args to do Blu-Rays are complex. So I recommend also installing a GUI wrapper for it like tsmuxergui.

sudo apt-get install tsmuxer tsmuxergui

Now follow a simple guide to run this app to create the file format & directory structure you need for a Blu-Ray. Here’s the guide I used. Do not select ISO for file output. When I did that, K3B didn’t know what to do with the ISO – my first burn was successful, but all it did was store the ISO file on the disk. Instead select Blu-ray folder. This will create the files & folders that will become the Blu-Ray. Also, you might want to set chapters on the tsmuxer Blu-ray tab. For one big file that doesn’t have chapters, I just set every 10 mins and it works.

When tsmuxer is done, run K3B to burn the files & folders to the blank Blu-Ray. Key settings:

In K3B:
Project type: data
The root directory should contain the folders BDMV and CERTIFICATE
Select cdrecord as the writing app
Select Very large files (UDF) as the file system
Select Discard all symlinks
Select No multisession

Then let ‘er rip. Mine burns at about 7-8x, roughly 35 MB / sec. When it’s done, pop the Blu-Ray into your player and grab some popcorn!

Openconnect VPN on Ubuntu 15.10

I upgraded my Thinkpad Carbon X1 to Ubuntu 15.10 when Linux kernel version 4 became reliable. I had to do this to get power management features working and get good battery life (> 6 hours). Since that upgrade, I have not been able to VPN connect to Disney’s servers from outside their network. Yesterday I finally got this working.

In short, it’s a bug in the vpn scripts that work with Openconnect on Ubuntu 15.10. They successfully connect, but they don’t set a default network route to the VPN network device. To do this, you need to enter a simple Linux route command after VPN connects. See here for details.

In short: after VPN connecting at the command line, look for this message on the console:
Connected tun0 as A.B.C.D, using SSL

Then enter this command:
sudo route add default gw A.B.C.D tun0

NOTE: this is fixed in Ubuntu 16 – no need to enter the route command anymore. It works right out of the box like it used to.

NOTE: after disconnecting VPN, it can leave the VPN DNS servers in the list. This slows down network since every DNS name won’t resolve until these IP addresses time out. To fix that and remove these VPN DNS servers from the list, run this script:

#!/usr/bin/env bash
sudo rm /run/resolvconf/interface/tun0
sudo resolvconf -u
cat /etc/resolv.conf

Dual Monitors in a VirtualBox VM? Yes!

My computers run Linux and that’s where I do most of my work. But I frequently need to use Windows for things like running Outlook to get Exchange email, and running PowerPoint. LibreOffice Writer & Calc are great and compare favorably with Office 2013 Word and Excel, but LibreOffice Present doesn’t hold a candle to PowerPoint.

I run Windows 7 in VirtualBox. It is flawless, but I was never able to get dual monitor to work. This is useful for presenting PowerPoint decks because dual screens let you use the presenter view and see your notes while presenting. I finally got it working – here’s how.

Make sure Virtualbox Guest Extensions are installed in your Windows VM. Shut down the VM client and open Virtualbox VM settings. Go to the Display section. Set Monitor Count to 2. If Virtualbox warns you about Video Memory, increase it as needed.

Start the VM, after Windows boots, log in. Go to Control Panel, Display, Settings. It should show 2 monitors (1) and (2), with (2) greyed out. Click Extend my Windows desktop onto this monitor, then click Apply. A new window pops up on your Ubuntu desktop. This is the Windows VM 2nd monitor (remember it’s virtual, so it’s just another window). Drag it anywhere you want. If you have 2 physical screens running on your Ubuntu desktop, turn off mirroring and drag this 2nd Windows screen to that screen.

Here’s an article with details.

NOTE: I tried this a few months ago and it didn’t work – it crashed the Windows VM. It looks like newer versions of the Linux kernel and video drivers are working better now. I’m currently running Ubuntu 15.10, kernel 4.2.0-25, VirtualBox 5.0.14, on a Thinkpad Carbon X1.

When Themes Go Bad

One of the things I like about Android is it’s an open system. I run Cyanogenmod on all my devices – phones & tablets. This eliminates manufacturer bloat, carrier bloat, gives me a clean pure Android experience with higher performance and better battery life, one-click rooting, full control over the device, and enables me to keep using them long after the manufacturer abandoned them to the planned obsolescence of “no more updates”.

I use my Galaxy Tab 3 8.0 as an electronic flight bag when flying. It’s an old tablet but it’s the perfect size for my kneeboard, has good battery life with enough performance to run my aviation software with GPS moving maps and other features. Recently, a developer on XDA-Developers ported CM13 to this device – and also ported TWRP too! I couldn’t resist. Worst case, if it was buggy, I could always revert to the last OEM supported version (Android 4.4).

Long story short, it worked great. It’s an unofficial build but works as well as any OEM build. And CM13 has nice extras like better battery life, better app permissions control and more customizability. Regarding the last point, it has Themes. When I’m in the mood for a dark theme (which also saves battery live on AMOLED screens), my favorites are Ash and Blacked Out.

Today I decided to try a new dark theme: Deep Darkness. When I turned it on it killed my tablet. That is, the home screen was still there but a system model dialog “System UI has stopped, (R)eport OK”. popped up every 5 seconds, and the tablet was unusable. I thought I would have to boot to recovery and wipe the tablet, but fixed it without that drastic step.

In summary, use the command prompt to wipe the internal Themes directories and remove the offending Theme. Here’s what I did:

First, get the full package name of the theme you installed. In my case, it was com.blissroms.moelle.ddoverhauled-1. Google is your friend here.

Now USB connect the tablet to a computer with ADB. Fortunately, I had the tablet defaulting to debugging mode and my computer was already authorized – because without a functioning system UI I wouldn’t have been able to switch it. Open a command prompt on your computer and use ADB:

adb devices

Checks to ensure it’s recognized. If not, you’re out of luck because without a working UI on your device, you won’t be able to turn dev mode on or authorize the computer it’s connected to. However, if you have a recovery like TWRP that has a file manager, you can boot to recovery and use that. If adb works, you have 2 ways to fix it.

adb shell

This worked, giving me a command prompt on the device. Even though the UI was dead, the OS was still running. Now you must find the app. It could be in /system/app, or /data/app. If you installed it recently, try ll -tr to get a time-sorted listing, newest last.

Now remove this app. First, try the clean way from your computer’s command line (not from the ADB shell on the device):

adb uninstall com.blissroms.moelle.ddoverhauled-1

But this didn’t work for me, said “Failure [DELETE_FAILED_INTERNAL_ERROR]”. Most likely because the Theme was being used. You don’t need to uninstall – you can simply nuke the app & its directory. You can do this from adb shell with rm, or you can boot to recovery, use the TWRP file manager feature. You will first have to tell TWRP to mount the system and data partitions. And you cannot mount them read-only, because you’ll be deleting files from them. From your ADB shell:

rm -fr /data/app/com.blissroms.moelle.ddoverrhauled-1

Next, remove the system themes directory – just nuke it. When it’s not there, CM13 will deal with it on boot, reverting to the stock CM13 theme. If you have to make the /system partition writeable, try this as root:

mount -o rw,remount /system

Again, you can do this from TWRP recovery or from your ADB shell:

rm -fr /data/system/theme

Next, reboot the device and you’re fixed, back to the system default theme.

Ubuntu 15.10 and Thinkpad Carbon X1 – Update

A couple of months ago the version 4 Linux kernel finally got stable and I started using it – great! Except for the current latest version, 4.2.0-27, which had a regression, giving filesystem write errors. So I’m running 4.2.0-25. But this is no big deal, since they’re all variants of 4.2.0, TLP power management works on all of them with the same version of linux-tools-common.

Another benefit: back when I first got this laptop in Aug, I had to change the video mode from SNA to UXA. Without this change, it had black or garbled screens on wake from suspend. SNA is faster than UXA, though it’s not a big difference. I tried switching back to SNA and it works now. So, no need to edit /usr/share/X11/xorg.conf.d/20-intel.conf to set this mode anymore.

Overall, the laptop is running great: all features working, fast and reliable. Ubuntu 15.10 and the kernel have stabilized to the point where it’s boring and highly productive. It just works.

The only problem I’ve encountered is that Gephi crashes whenever it tries to display a graph. This happens both in UXA and SNA. It starts up fine, so this appears to be a problem with the way Gephi uses OpenGL. Other OpenGL apps work fine, so it’s not necessarily a driver problem. And Gephi works fine on my desktop, which is also running Ubuntu 15.10.

Android 6 and SD Cards

Android 6 “Marshmallow” added a new feature: format SD cards as internal storage.

Prior versions of Android always formatted SD cards in VFAT, which doesn’t support filesystem permissions and limits how the SD card can be used. You can store data – files, music, videos, etc. –  but many applications can’t be installed there due to the lack of filesystem permissions.

Android 6 can format the SD card as ext4, the same filesystem used for internal storage. This makes the SD card just like internal storage, with no restrictions. This seems like a good thing, but in practice it has limitations that make it unusable. NOTE: I tried this in Cyanogenmod 13; it might work better (or worse!) on other versions of Android 6.

I assume you already know the basic facts. If not, just Google it or read Ars Technica. Here I’ll go into some details. When Android 6 detects an SD card it lets you choose how to format it: internal (ext4), or portable (VFAT).

When I selected internal, the SD card was formatted ext4 and mounted at /mnt/expand/a474aa54-1e0a-4df6-9bc1-1e202a5167fa. This looks like a GUID generated from the SD card, will probably be different for every card.

When I selected portable, the SD card was formatted VFAT and mounted at /storage/4E3F-1CFD. This too appears to be a random ID attached to the card.

I really wanted to use internal, but several reasons prevented me:

When I first turned on internal storage, it seemed to work perfectly. Several apps moved there and everything worked fine. Then I rebooted and discovered a bug – several app icons missing from my home screen, and a system modal dialog popped up saying “System is not responding, (wait) or (close)”. I selected “wait”, everything was running fine, but the apps were simply missing from the home screen. They were still there, installed on the device, working fine, nothing lost. I dragged them back to the home screen and everything worked – until the next reboot.

Turns out, the missing apps were those stored on the SD card. The apps are still there and still work, no data is lost. But every time the device is rebooted they disappear from the home screen and you get this system error message. My hypothesis is that the SD card is being mounted too late in the boot process, so when the home screen opens and the home screen launcher (for CM13, Trebuchet) needs the icons, the apps aren’t yet available because the storage mount is still “in progress”. Just a guess… of course I can’t be sure. IF this is the root cause, then the boot process should mount the SD card as early as possible and block, waiting for SD card mount to fully complete, before proceeding. This is a serious annoying bug in Android 6 or CM 13. This alone would make the internal SD option unusable, but it got worse.

You don’t get to control what is stored on the card vs. internal. Apps themselves decide, based on the app’s code or settings in the app’s Android manifest file. Problem is, internal storage is new and most app developers haven’t thought about it much. So the behaviors you get don’t always make sense.

The camera app (neither the CM13 built-in, nor Google AOSP) didn’t store its photos or movies on external storage. Yet this seems to be one of the most obvious uses for external storage. And the app’s “storage location” setting gave me an error message when I selected external storage, refusing to take any pictures. I was forced to use internal storage for photos and movies.

Some apps install to external storage, but keep their data on internal. This is the exact opposite of how SD storage should be used! For example, Sygic, whose data can be a GB or more depending on how many states you download. If there ever was a use for this feature, this is it. Yet it didn’t work. Sygic lived on external storage but stored all its data on internal.

Some apps simply can’t find the SD card when it’s formatted as internal. For example, Solid Explorer. It could find the SD card only if you’re rooted (and I was, in CM 13) and you manually navigate to the above path where it’s mounted.

When I did get apps to put their data on the SD card, I ran into endless permissions problems. Android 6 by default assigns strict permissions to all the dirs & files created by an app. In my opinion, this has always been a solution in search of a problem, creating more hassles than it solves. I found myself constantly navigating to the various dirs and files and changing their owner, group or marking permissions to 666 or 777. This quickly gets tiring, then annoying, then infuriating.

Some apps never were able to write to the SD, even after setting up permissions. Office Suite Pro was one, though there may be others.

In summary, I gave up on internal SD storage – but I still “believe” this is the right way go, hope the Google Android team fixes the above problems. However, switching back to portable storage wasn’t easy sailing either.

Before switching back to portable (each switch re-formats the SD card), I used the Android 6 storage settings to individually move each app back to internal storage. Most of them worked after being moved, but I had to uninstall & reinstall a couple of them. Then Android 6 reformatted and remounted the card.

The home screen launcher bug never occurred – yay!
External storage was recognized by the camera app and the file manager – yay!
Apps that support external storage in Android 4 and 5 worked the same way as before – yay!

But, some apps (like Office Suite Pro) still could not read or write the SD card. On Android 4 and 5 you can fix this by editing /etc/permissions/platform.xml (more detail here). On Android 6, I went to edit that same file and found the WRITE_EXTERNAL_STORAGE  permission wasn’t even there! I added it and READ_EXTERNAL_STORAGE to the Android 6 file, then rebooted.

Office Suite Pro could write to the SD card – yay!

There was only 1 app I couldn’t get to use the SD card in Android 6, no matter how I configured it (internal or portable) – that was Sygic. The new way didn’t work – Sygic put itself on external but stored all its data on internal. The old way didn’t work either. Sygic detected that its files moved to the SD, rearranged them, then hangs on its white startup screen, eventually crashes. Rebooting etc. doesn’t fix this.

In summary, Android 6 made a good sporting effort to improve SD card storage. This is a great idea made unusable by poor implementation. If the Android supported filesystem links I could fix all of the problems myself (e.g. ln -s /mnt/expand/a173822/Download /storage/emulated/0/Download). But I haven’t found a way to do that – at the command line, this fails even when rooted.

Optimizing the Penalty

I’ve been optimizing a Naive Bayes classifier and one part of it was tuning the penalty ratio. First, some background on Naive Bayes.

When training Naive Bayes, you count how many times each feature occurs in each class. At the end of training you turn each count into a ratio. I call the table containing the ratio for each feature for each class the FDM for Feature Density Matrix. For example, suppose you are classifying email as Spam or Ham and the features are words in email body.  Each word has a ratio representing how often it appeared in that class. For example the word viagra might have a high ratio in Spam and low (or never appear) in Ham.

To classify with Naive Bayes, you take each feature from the item (in this case words from an email), get the ratio for each, multiply them all together to get a single number. Do this for each class (in this case twice, once each for Spam and Ham), then assume the item belongs to whichever class produced the highest number.

NOTE: this is where the name Naive Bayes comes from. Bayes, because you’re multiplying frequencies (probabilities) together, as if the probability of them all happening together equaled the product of their individual probabilities. Naive, because this would be true only if each was independent or uncorrelated to the others. They’re not, so this (naive assumption) is wrong, but it still works surprisingly well. An incorrect model is often a useful tool.

So what happens if the item you’re classifying has a feature that is not in the FDM? It could be a new word that never appeared during training, or you saw it in training but not for this particular class? You can’t use 0 because that would nullify the entire number for this class (remember, you’re multiplying them all together). You can’t use 1 because that won’t penalize the ratio for this feature that was never seen for this class.

Here’s one solution: use a small constant ratio K for words that never appeared in the class. During training, while building the FDM, keep a running track of the smallest ratio, then at the end use it to set K. The question is: how much smaller than the minimum should K be? There is no easy answer. It depends on how much you want to penalize a class that doesn’t have a feature from the item you’re classifying. Whether you over-penalize it (K too small) or under-penalize it (K too big), depends on the actual results. Optimal classification accuracy may come from some in between value of K.

NOTE: why am I assuming K should be smaller than the smallest ratio in the FDM? Because K is used for features that never appeared for a certain class. If that feature appeared only once for this class, we would have some other ratio R. Appearing never is worse than appearing once, so K should be smaller than R. We don’t know exactly what R would be, but we do know that the smallest ratio in the table is for some feature that occurred at least once, so K should be less than that.

But then I thought, why does K matter at all? There is a kind of symmetry to this. For each class, you’re multiplying a bunch of ratios together. If a certain feature is not in the FDM, then you’ll have the same K in each of the overall class ratios. Thus K will cancel out – that is, if the same K is in all of them, it can’t change which is biggest.

As it commonly happens, that random thought turned out to be half-baked. Yes, it’s true – but it’s true only if the feature is missing from the FDM. That is, entirely missing – no class had this feature. That is the symmetric case. But what often happens is a particular feature exists for some classes yet not others. Like the viagra example above – it might appear in the FDM, but only for Spam, not for Ham. This is asymmetric. So yes, the value of K really does indeed matter, because other classes might have contained the feature that this class did not.

Let’s take 2 cases to see how K can be both “too big” and “too small”.

First, what if K is too big? This is the obvious case. Consider an email having the word viagra. For Spam, the ratio for viagra is pretty high. For Ham, we need that ratio to be low. But if viagra never appeared in Ham, we rely on K. If K is too big, it won’t penalize the Ham class enough and the overall ratio from Ham might be big, causing us to classify the email as Ham.

Second, what if K is too small? Suppose there is a relatively “neutral” feature, say a relatively rare word like zugzwang, which appears in some classes but not others, and its ratios are about the same in the classes where it does appear. This means the word is not useful for classifying items (because the ratios across classes are similar). Yet the classes where it didn’t occur get unfairly penalized because K is so small.

In summary, K should always be less than the smallest ratio in the FDM. But you need to test different values of K to see how much smaller to set it. I find values in the range 10^-3 to 10^-10 work well. That is, K is 1/1000 or less of the the smallest ratio in the FDM. Your mileage may vary!

Take that Data with a Grain of Salt

I wanted to geographically visualize Twitter activity by zip code. I had a set of Tweets captured from GNIP Powertrack. Each tweet has the author’s public profile, as well as the tweet text and several other fields.

First, I got the zip code data: about 42,500 zip codes each with lat/lon coordinates and population estimate from US 2010 Census. Next, I loaded it into a Hive table. I used Hive because the tweet data was big enough to live in Hadoop, though you could do the geo zip code manipulation I’m describing in any database.

Next, from the Twitter data I found the closest zip code to each author by computing the distance from the author’s lat/lon to the lat/lon of every zip code. Of course, this only works for people who put their location in their Twitter public profile, but that was surprisingly many people. I only needed the closest, which simplified the computation since units didn’t matter.

Next, I computed metrics (like tweet counts) per zip code. Raw numbers would not be useful – of course most of the tweets come from the zips with the most people. I divided by zip code population to get per-capita values.

At this point I should have been done, in which case it would be so simple and quick there would be no reason to write about it here. However, that’s not what happened. These values were all over the map: spanning 5 orders of magnitude (0 – 10^5). Per capita?! This didn’t make sense. I expected values of zero or close to zero, but reasonable maximums would be a fraction of 1 – nowhere near the tens of thousands.

Investigation revealed the outliers were tiny zip codes like 30334 covering about 1 square block having tiny populations like 1. Whenever a prolific tweeter (or tweet-bot) resides in one of these it blows out the data. Clearly, these are zip codes too new to have accurate 2010 Census data.

The problem is actually more general: any small zip code can blow out the data – even if its population data is accurate. I needed a way to factor out these tiny zip codes. One way to do this is to merge each small zip code into its nearest big zip code. Split zip codes into 2 groups: small and big, using a threshold population. I made a histogram of zip codes by population to see what would be a good split value. I chose 25-75 since 25% is about 10,000 zip codes, which is enough for smooth dense coverage. The histogram showed a population of about 8,000 would give this 25-75 split.

Next, I found the nearest big zip for each small zip. In SQL this is a full outer join between the big and small zip sets, computing distance between each, then picking min(distance) for each small zip. With roughly 10,000 big zips and 30,000 small zips, the cross product would have about 300 MM rows.

Now for a brief aside on Spark-SQL versus Hive. In most of my work, Spark-SQL is about 10x faster than Hive. Yet when I ran this join, Spark-SQL was going to take 18+ hours! I copied the data to Hadoop and ran it in Hive, and it took 30 minutes. So it appears Spark-SQL is still new enough they have some work to do optimizing certain kinds of joins.

Now, SQL makes it hard to select which big zip the minimum distance came from. That is, you can select zip_small, min(distance) and group by zip_small. But it doesn’t tell you which big zip that minimum came from. Once you know that minimum distance, you can select the small zip and distance from the above cross-product table, but this won’t necessarily be unique. Each small zip has about 10,000 big zips, and it might be the exact same distance from several.
this duplication explodes the join.

The textbook workaround would be to rank and select by rank within distance, but that would be complex. Instead, I thought of a simple hack: force all the distances to be unique, so you can select the small zip and min distance from the cross-product table without any duplicates exploding your join.

To make the distances unique, sprinkle some random bits into them, and do it in the least significant bits so it doesn’t affect which is closest. Put differently, salt the distances with just enough salt to ensure they’re unique, and push the salt down into the least significant bits so it doesn’t affect comparisons.

Doing this is simple. First, query to find how close together are the closest distances. In my case it was about 0.1 (unitless measure). Then, generate a random float between 0 and 1, multiply it by 1/10 of that figure to be safe, and add it to each computed distance. Because computers generate random numbers in sequences, getting a duplicate is statistically improbable, so every distance will be unique, and you scaled it small enough (in my case by multiplying by 0.01) it won’t change each zip’s closest neighbor.

Next, for each big zip, sum the populations of all its small neighbors and add this sum to the big zip’s population to get the population of the big zip with all its neighboring small zips.

Now we have a table that has: zip_small, zip_big, pop_sum. Map each tweet’s author’s zip to zip_small, then report it as zip_big and divide by zip_sum to get per capita.

Of course, we need to add big zips to this table too, so each big zip maps to itself with the population sum.

This worked beautifully and smoothed the data.

NOTE: there are many applications where adding a little random noise is helpful. This is just one example, and a simple one.