Blog

How Music ID Apps Work

While it seems like our phones can do just about anything these days, every now and then they will produce some small miracle that will seem like nothing short of magic. In the case of music ID apps, like Shazam, SoundHound, and many others, it can often seem like what they do for us is explainable. These apps are capable of listening to a short, 4-5 second clip of a song, and in some cases even the user’s spoken or singing voice, and can correctly identify all the information about that song or tune in a matter of seconds. For years people have struggled with hearing things they enjoyed but couldn’t identify, and until the invention of music ID software there was no way to adequately fix the problem.

The Beginning

With the advent of the smartphone, potential solutions to common problems increased dramatically. No longer cut off from the world, the human race realizes a level of connection to one another far greater than ever in history. Information of all sorts is right at our fingertips, and that includes an extensive library of essentially every song recorded in modern history. With all that data out there, it can be daunting to decide how best to use that library to find the information you need. Music ID apps like Shazam paved the way for sound recognition software. Instead of forcing a user to search manually through a labyrinth of song data, the app generates a unique fingerprint of the recorded clip, and compares it to a database of similarly generated fingerprints for each song in the library. The service was first launched in the UK in 2002, but was popularized later when it was brought to the iPhone in the US. Soon after others came along to try to get in on the budding market of music recognition.

How It Works

While it may feel like some black magic is fetching your song data for you in the blink of an eye, the reality of it is far less grandiose, and much easier to understand. Some assume that an evolved version of voice recognition software must be utilized, however that would be impractical due to barriers beyond recognizing the voice, such as identifying the song itself, and the version of the song, as many songs have several versions by the same singer. The apps work instead by using proprietary formulas for translating song data into unique numerical codes. A library is generated for known songs, and when the app is given a sample recording it creates a fingerprint for that sample and compares it to the library. Essentially, the software does for sound what Google does for words and images.

While that sounds simple enough, the complexity of the software is the most common source for problems. For a long time it was considered impractical to attempt to boil down a song into a set of digits, as there is simply too much information within a song to abridge into an easy-to-use fingerprint. Instead, the software creates a three dimensional plot of the song in order to compare three different data points simultaneously; frequency, amplitude, and time. By doing so, this allows the software to ignore the insignificant portions of the song data and focus only on the high-energy, intense moments. These distinct data points are generated for each song at roughly the rate of three per second, well within the range of acceptable code complexity for use as a fingerprint. The creator of Shazam explained in full detail how his audio search algorithm works in this published scientific paper.