Monthly Archives: March 2016

Most Common GA Accident Types

In the past 100 years nobody has created a new way to destroy an airplane. Here are the most common ways, roughly in order of most common first:

  1. Weather: pilot didn’t respect Mother Nature (she doesn’t have to respect you – she was here first).
  2. Fuel: pilot ran out of fuel (airplane engines run better with fuel).
  3. Planning: or lack thereof – over gross weight, out of CG limits, density altitude, VFR into IMC, etc.
  4. Maintenance: pilot departed with known aircraft deficiency (airplanes work best when properly maintained).
  5. Pilot was cognitively impaired (fatigue, drugs, etc.).
  6. Stupidity: pilot intentionally did something stupid (buzzing, “watch this”, etc.).

Every aviation accident I know of falls into at least one of these categories – sometimes more than one. The good news is, improving safety is simple common sense. Don’t do these things! Safety improves one pilot at a time. If you don’t do these things, you’ve improved your safety roughly 10-fold and you’re making GA safer than driving or bicycling.

Why Ignore Unique Words in Vector Spaces?

Lately I’ve been working on natural language processing, learning as I go. My first project is to discover topics being discussed in a set of documents and group the docs by these topics. So I studied document similarity and discovered a Python library called GenSim which builds on top of numpy and scipy.

The tutorial starts by mapping documents to points in a vector space. Before we map we do some basic text processing to reduce “noise” – for example stripping stop words. The tutorial casually mentions removing all words that appear only once in the corpus. It’s not obvious to me why one would do this, and the tutorial has no explanation. I did a bunch of googling and found this is commonly done, but could not find any explanation why. Then I thought it about a little more and I think I know why.

We map words into a vector space having one dimension per distinct word in the corpus. A document’s value or position along each dimension (word) is computed. It could be the simple number of times that word appears in that doc – this is the bag of words approach. It could be the TF-IDF for that word in that document, which is more complex to compute but could provide better results, depending on what you are doing. However you define it, you end up with a vector for each document.

Once you have this, you can do lots of cool stuff. But it’s this intuitive understanding of what the vector space actually is, that makes it clear why we would remove or ignore all words that appear only once in the corpus.

One way to compute document similarity is to measure the angle between the vectors representing documents. Basically, are they pointing in the same direction? The smaller the angle between them, the more similar they are. This is where cosine similarity comes from. If the angle is 0, cosine is 1: they point in the exact same direction. If the angle is 180, cosine is -1: they point in opposite directions. Cosine of 0 means they are orthogonal. The closer to 1 the cosine is, the more similar they are.

Of course, no matter how many dimensions the vector space has, the angle between any 2 vectors lies in a 2-D plane – it can be expressed as a single number.

Let’s take a 3-D example: we have 3 words: brown (X axis), swim (Y axis), monkey (Z axis). Across our corpus of many documents, suppose only 1 doc (D1) has the word monkey. The other words each appear in several documents. That means the vectors for every document except D1 lie entirely in the X-Y plane – their Z component is 0. D1 is the only document whose vector sticks out from the X-Y plane.

Now it becomes easy to see why the word monkey does not contribute to similarity. Take any 2 vectors in this example. If both are in the X-Y plane then it’s obvious that the Z axis has no impact on the angle between them. If only one is in the X-Y plane (call it Dx), it means the other (not in the X-Y plane) must be D1. Here, the angle between D1 and Dx is different from the angle between Dx and the projection or shadow of D1 onto the X-Y plane. But, it doesn’t matter because this is true when comparing D1 to every other vector in the set. The relative differences between D1 and each other vector in the set are the same whether we use D1 or the projection of D1 onto the X-Y plane. In other words, using cosine similarity they still rank in the same order nearest to furthest.

Another way to see this is to consider the vector dot product between D1 and D2. As a reminder, the dot product is the sum of the products of each vector’s components in each dimension. Any dimension that has a value of 0 in either vector contributes nothing to the dot product. And of course, every vector except D1 has a 0 for the Z dimension, so the Z component of the dot product will always be 0. The cosine of the angle between any 2 vectors Dj and Dk is equal to their dot product divided by the product of their magnitudes. If we normalize all vectors to unit length, the denominator is always 1 and cosine is the dot product.

Because of this, any word that appears exactly once in the corpus can be ignored. It has no effect on the similarity of documents. But we can actually make a stronger statement: any word that appears in a single document can be ignored, no matter how many times it appears in that document. This is a misleading – yet not incorrect – part of the tutorial. It removes words that appear only once in the corpus. It could go further and remove words that appear in only 1 document, even if they occur multiple times.

I haven’t found any explanation for this at all, so I can’t confirm my explanation. But I suspect this is why once-occurring words are often ignored. In fact, sometimes people get better results by ignoring words that occur in more than 1 document, if they occur only in a very small number of docs. The reasoning seems to be that words appearing in a handful of docs from a corpus of thousands or millions, have negligible impact on similarity measures. And each word you ignore reduces the dimensionality of the computations.

Ubuntu Linux and Blu-Ray

Getting Linux to work with Blu-Ray took some custom configuration. The state of Linux and Blu-Ray has much to be desired and doesn’t work out of the box. But it can be made to work if you know what to do. Here’s how I got it to work.

Reading Blu-Rays

This was the easy part. You can do it in 2 ways: VLC and MakeMKV

Blu-Rays don’t play in VLC because of DRM. To play them in VLC you need to download a file of Blu-Ray keys, like here: http://vlc-bluray.whoknowsmy.name/. This may not be the best approach because the file is static. New Blu-Rays are coming out all the time. But it works if you regularly update this file and it has the key for the Blu-Ray you want to play.

MakeMKV is software that reads the data from a Blu-Ray and can write it to your hard drive as an MKV file. It can also stream the Blu-Ray to a port on your local machine. Then you can connect VLC to play the stream from that port. Viola! You can watch the Blu-Ray on your computer with VLC, even if you don’t have the keys file. MakeMKV is shareware – free for the first 30 days, then you should pay for it.

Writing Blu-Rays

The first challenge writing Blu-Rays is Ubuntu’s built-in CD writing software, cdrecord. It’s a very old buggy version. This happens even with the latest repos on Ubuntu 15.10. It works fine for Audio CDs, data CDs and DVDs. But not for Blu-Ray. The first step is to replace it with a newer, up-to-date version. The one I used is CDRTools from Brandon Snider: https://launchpad.net/~brandonsnider/+archive/ubuntu/cdrtools.

Whatever front end you use to burn disks (like K3B) works just the same as before, since it uses the apps from the underlying OS, which you’ve now replaced. After this change I could reliably burn dual-layer (50 GB) Blu-Rays on my Dell / Ubuntu 15.10 desktop using K3B. My burner is an LG WH16NS40. It is the bare OEM version and works flawlessly out of the box.

Now you can burn a Blu-Ray, but before you do that you need to format the video & audio and organize into files & directories that a Blu-Ray player will recognize as a Blu-Ray disc. What I’m about to describe works with my audio system Blu-Ray player, an Oppo BDP-83.

The command-line app tsmuxer does this. But it’s a general transcoder that can do more than Blu-Ray, and the command line args to do Blu-Rays are complex. So I recommend also installing a GUI wrapper for it like tsmuxergui.

sudo apt-get install tsmuxer tsmuxergui

Now follow a simple guide to run this app to create the file format & directory structure you need for a Blu-Ray. Here’s the guide I used. Do not select ISO for file output. When I did that, K3B didn’t know what to do with the ISO – my first burn was successful, but all it did was store the ISO file on the disk. Instead select Blu-ray folder. This will create the files & folders that will become the Blu-Ray. Also, you might want to set chapters on the tsmuxer Blu-ray tab. For one big file that doesn’t have chapters, I just set every 10 mins and it works.

When tsmuxer is done, run K3B to burn the files & folders to the blank Blu-Ray. Key settings:

In K3B:
Project type: data
The root directory should contain the folders BDMV and CERTIFICATE
Select cdrecord as the writing app
Select Very large files (UDF) as the file system
Select Discard all symlinks
Select No multisession

Then let ‘er rip. Mine burns at about 7-8x, roughly 35 MB / sec. When it’s done, pop the Blu-Ray into your player and grab some popcorn!