Shazam Turns 20

Read original article here

tasty_freeze 1 hour ago | next [–]
I have a relevant story here. The inventor of the Shazam algorithm, Avery Wang, gave me a demo of it within a couple weeks of it being created. Here is the backstory (partly from personal knowledge, partly relayed by Avery).
Avery had gotten his PhD from CCRMA at Stanford under Julius O Smith. His PhD had been on the topic of automatically ("blind") recovery of individual vocal/instrument tracks from a final mix. From there he joined a startup, Chromatic Research, where I was also at. He created lots of code and patented some algorithms related to resampling and MIDI synthesis, stuff like that. Avery was (and is) a super nice guy, humble but incredibly smart -- he could work not only with the high level mathematics but was also equally excited to fine tune assembly code.
After Chromatic folded, Avery had been struggling to get his own startup off the ground. About the same time, the Shazam guys had the idea for the product but didn't know how to create the algorithm. They approached Smith at CCRMA looking for someone capable of creating something that worked. Smith suggested they try Avery Wang.
At first Avery said, "Hmm, that seems difficult, but let me think about it." Within a week he had a demo running on a few thousand songs he had gathered from CDs. I'm sure a lot of refinements went into it after that, but the core idea took him a weekend.
[ All factual mistakes above are due to me and 20 year old memories -- if I misrepresented something it certainly wasn't due to Avery telling me something that wasn't true ]
EDIT: A blurb about Avery https://www.seti.org/avery-wang
bluetidepro 10 hours ago | prev | next [–]
Shazam is one of the very few apps in the past 20 years that STILL "wows" me. I have no idea how the tech works, and I even sort of like not knowing, to be honest. It's one of the very few apps out there that still in exist in a "magical" way to me. I am constantly impressed with how fast/easy it works, even with very obscure music. What an amazing app.
Fun quick related story, about 10 or more years ago there was a back tracking song on a TV show (Scrubs) that I really liked that was only in the Netflix version. It was just an instrumental song with some French sounding words speaking in it so there was no easy way to search for it. However, it was distinct enough that it didn't seem like something made just for the show. It was also pretty quiet and under some talking in the tv show scene. I had posted on reddit asking if anyone knew it, and never got any responses. I searched all over the web, but no source had the track details. It drove me crazy every time I would hear the song in re-watching the show, and I still could not track it down every few years when I tried again. Back then, Shazam had no cataloging of it so it wasn't in there either yet. However, when re-watching it a few years back again, I tried Shazam again and to my surprise it finally worked. I was blown away that Shazam was finally able to solve this 10+ year mystery. It was one of the coolest feelings every to scratch that itch finding this rare French song and hearing it in full. It was truly magical.
EDIT: Oh sorry, I didn't think anyone would actually care about the song itself lol It was called "Sans Hésitation" by the French-Canadian band "Chapeaumelon". https://www.youtube.com/watch?v=Ju4d3YQhByU - It's also interesting cause now the song does in the episode in tv music database sites. Very cool.
quantumduck 5 hours ago | parent | next [–]
Shazam used to wow me, but then as others mentioned in the replies it's essentially matching the signature of the sound to the sounds in the database. If it's one of the song, it gets matched fairly quickly.
Wow blew my mind was when Google introduced 'hum and we'll recognize the song for you' in Google assistant: https://www.google.com/amp/s/blog.google/products/search/hum...
It works so well even with my shitty humming - even my girlfriend can't recognize what the song is but Google can. It doesn't even have the same signature as the original audio file, just similar hums in a noisy environment and it still works. Black magic fuckery.
tasty_freeze 1 hour ago | root | parent | next [–]
> it's essentially matching the signature of the sound to the sounds in the database.
You aren't giving it enough credit. The algorithm uses just a few seconds from any part of the song, and has to deal with phone audio quality and often background noise. I mean, you can be in a bar with all that jabber and hold up the phone and it could pick out the song. The app on the phone does the preprocessing to the audio before it is sent to the server that does the matching ... using the comparatively miserable power of a 2001 era cell phone.
quantumduck 8 minutes ago | root | parent | next [–]
Oh that wasn't my intention - Shazam was and is groundbreaking, they did it when no one else could. All I meant was that it seems more "doable and I probably understand how it works" when compared to how Google assistant recognizes songs from my humming.
What really wows me is that Shazam started in 2002. It was a phone number you would call on your cell phone and let it listen to your environment.
Way back then, it was doing everything you describe, but over low quality band limited telephone lines.
swores 6 hours ago | root | parent | next [–]
As an almost teenager at the time, that (Shazam over the phone with an answer texted back - which I used on a Nokia 3310) was the one thing that convinced me we would soon have pocket devices that really could do anything.
And while it took a few iterations (for me, from palm pilot to blackberry as a teenager, then eventually moving to iPhone after a few too many painful Blackberry upgrades - still missing that unified inbox though, as is everyone else I know who had a BB of that era... and frankly missing a great physical keyboard on a phone, too) I still am impressed on a daily basis that I do indeed have the device in my pocket that 12 year old me dreamed of.
vlunkr 6 hours ago | root | parent | next [–]
I didn't know it ever worked that way, that's incredible. Reminds me of ChaCha, the texting service where you texted questions and a human would quickly look up the answer and text it back. It's a very cool idea that was quickly outmoded by smart phones and is kind of lost to history now.
jasonwatkinspdx 3 hours ago | parent | prev | next [–]
I don't know about Shazam's current algorithm specifically, but years ago I worked at a place with a mathematician that worked on gracenote's algorithms, and asked him for the basics on how it works.
Basically, it records audio chopping it up into small segments and throwing them through a FFT. Then it takes that, and thinking of the data like a greyscale spectrograph image, runs it through a quantization filter that helps reject some noise, then converts that to locality sensitive hashes that are sent to the server. So basically FFT, filter, hash, lookup.
robbyking 8 hours ago | parent | prev | next [–]
The first time I heard of Shazam was on a road trip with a friend of mine who had minimal tech skills at best. I was already 10 years into my career as an engineer, and when he told me about it, I honestly didn't believe him; I was positive he was mistaken, and speculated it was a service similar to Aardvark[1], which was a peer-to-peer information engine.
I was wrong, of course, Shazam really did live up to its hype. I think it's interesting that the someone knows about how a technology works the more sceptical they are of what it is capable of.
Don't want to spoil it for you if you really don't want to know but I want to share to others in case they do because I found it so interesting when I first learned!
It looks like others shared the paper: https://www.ee.columbia.edu/~dpwe/papers/Wang03-shazam.pdf
It's short but very cool. I read it a while ago and honestly can't pretend I fully grokked everything, but my understanding was that you can't just use a Fourier transformation alone. Noise would basically make this impossible.
So what I'd consider the key insight is that they compressed songs down to "fingerprints". IIRC they noticed that songs, even in noisy environments, preserved certain bits of information. Particularly, they could look at the spectrogram and see peaks of amplitude in the tapestry. They essentially set some radius and scanned the spectrogram. In a given radius, only the largest amplitude value in time and frequency would be preserved. So you've reduce a 3MB song to several bits.
This would be good enough for small databases (I think). But it's intractable for anything practical. So they built hashes out of these fingerprints using pairs of the preserved peak bits. They would choose a certain peak (called the anchor point), record its time offset from the start of the song, and then form pairs with other nearby peaks, saving the pairs of frequencies (but discarding e.g their amplitudes). So for each of these anchor points, you would get a 64 bit value: 32 bits for the time offset and track ID and 32 bits of frequency-pairs.
When you wanted to look up a song, they would fingerprint your snippet into multiple 32bit hashes and compare them against the frequency-pair hashes in the database. If a song was a good match, then you would see that your snippet matched against multiple hashes from that song, and specifically they matched linearly over time (I'm struggling to explain this bit but it's visually obvious if you look at Figure 3 in the paper).
I probably got some of this wrong, but I hope it's a helpful summary of the paper. I remember struggling to understand parts of it, so please let me know if anything I said is egregiously wrong!
goldcd 9 hours ago | parent | prev | next [–]
I think all (so simple) you have to do is parse all the tracks ever made, and say generate a sequence of snapshots of what the tune sounds like and the delta. e.g. if it was notes (for simplicity) E,D,C,D,E,E,E,D,D,D,E,E,E is the start of "Mary had a little Lamb" Millions of tracks contain the note E. Many hundreds of thousands probably have the note D next - and as you work through the sequence, you're pruning down that list until you who what it is. Bit that makes my mind hurt though, is the data-structure you put those sequences into to make it quickly searchable. Users can start recording at any point in the song - so you can't just prune a tree down from a known starting point. There's going be be background nose - so you need some way of "when you have no choice left", I presume sticking wild-cards into the previous decisions, to see if you end up back on a known track.
Yeah - I think it's magic as well.
Other thoughts: I used it back in the UK when it launched, and the first track I ever used it on dialling (2580 - the numbers down the middle of your keypad) was also a French track (MC Solaar – La Vie Est Belle)
I always felt they missed a trick, just identifying music (and then trying to sell you stuff). Surely they could have used the same tech to seamlessly mix all music together. (i.e. take the sequences within tracks they find hard to differentiate, and then use these points to allow two tracks to be mixed together). What's the minimum number of tracks it would say take to seamlessly mix from Megadeth to Mozart?
senko 8 hours ago | root | parent | next [–]
We use AcoustID in MusicBox[0] to identify and deduplicate content, and it works great for us.
What we do is calculate the acoustic fingerprint of every uploaded content and compare/check for duplicates (only authorized staff can upload, but this still helps a bunch with user errors and in cases where you need to reupload a track). Then we compare the fingerprints, using this[1] approach, so we can fine-tune the similarity based on our needs.
In our case it's been very effective. Yes, live versions are treated as different ones (which is exactly what we need in our case, so it's a feature for us), but mechanical differences between tracks (volume, slight distortions from codec, different compression levels or remasters, or track being cut differently) are just ignored.
If you ever want/need audio fingerprinting, I can warmly recommend it.
[0] Music streaming service optimized for cafes, restaurants and other venues - https://musicbox.com.hr/ [1] https://groups.google.com/forum/#!msg/acoustid/Uq_ASjaq3bw/k...
jefftk 7 hours ago | root | parent | next [–]
> live versions are treated as different ones
I think you're talking about a live recording vs a studio recording? But what I think zelos was talking about was "someone is currently playing music live, what is it?", which is a lot harder because you need to recognize the essence of a song and not the essence of a recording of a song.
nibbleshifter 4 hours ago | root | parent | prev | next [–]
> Surely they could have used the same tech to seamlessly mix all music together. (i.e. take the sequences within tracks they find hard to differentiate, and then use these points to allow two tracks to be mixed together). What's the minimum number of tracks it would say take to seamlessly mix from Megadeth to Mozart?
I noodled around with this idea in my free time a few years ago, got absolutely nowhere really usable with it (I probably put in a couple hundred hours).
I knew I was limited by my dataset (small), code quality (terrible) and understanding of musical theory (virtually nil).
Maybe I'll pick up that idea again - even doing beat matching would be kind of neat.
bambataa 2 hours ago | root | parent | prev | next [–]
Shazam as a product feels a bit odd. Almost as if they’ve never quite outgrown their slightly sketchy “advertised on MTV2 alongside the Crazy Frog” origins.
They must have loads of data on songs people actually want to know yet never really managed to turn themselves into anything more sophisticated.
saghm 8 hours ago | root | parent | prev | next [–]
My instinct is that it probably isn't as simple as you describe because not only are there multiple notes at a time in a given track (i.e. chords), but there are also several tracks playing at once! It's possible that they're literally generating data like {guitar 1: C chord, guitar 2: single note E, bass: single note E} for every point in time, but even then each instrument isn't playing the exact same rhythm most of the time, so the notes won't exactly line up. I guess I don't think it's completely computationally infeasible to do it this way, but it seems more likely that they're just trying to separate the music from the background noise and then try to find the closest match to the music audio as a whole rather than trying to separate it into component.
goldcd 1 hour ago | root | parent | next [–]
Sorry - I wasn't clear. I don't mean they're listening for notes. They're just analyzing the wave-form/fingerprint/whatever-you-want-to-call-it that's being generated at a moment, and then one form the next moment, then the next.
One of these might match random points in many songs, but a far smaller subset of these will have the same three in the same sequence.
It’s all just Fourier analysis I’m guessing?
Which I always find to be simultaneously simple and obvious as well as total magic.
nomilk 9 hours ago | prev | next [–]
> August 2002: Shazam launches as a text message service based in the UK. At the time, users could identify songs by dialing “2580” on their phone and holding it up as a song played. They were then sent an SMS message telling them the song title and the name of the artist.
Incredible! Curious to know what exactly happened backend after it listened to the audio, and what hardware it ran on.
cannam 7 hours ago | prev | next [–]
It's obviously a cracking algorithm, but what made Shazam doubly remarkable was how efficiently they turned it into a working product.
It wasn't just a case of developing an algorithm that could in theory be used to match an audio signal against all the world's pop songs. They presumably also had to get hold of a substantial number of those songs, fingerprint them, and roll out the search robustly against generally very poor audio hardware using simple telephony services at (for the time) quite considerable scale. They did it very quickly, it worked super well from launch, and it's been running continuously ever since.
I've read the paper about the method, but I would love to know more about the original development and deployment.
elboru 6 hours ago | prev | next [–]
Sometimes, I like to stop and think about all the amazing things that we can do with our phones and that we take for granted.
What I do is to imagine myself finding a smartphone in elementary school (90s kid). These are a few things that would blow my mind:
- Having a digital global map, with multitouch, that can show me where I am in that map. I can search anything and find reviews from virtually anywhere in the world. I can zoom and see my actual house. I can use street view.
- I have access to any song I want.
- The phone can listen to a song and it can tell me the name of it (then I can listen to it again)
- I can play video games with much better graphics than my N64
- I can watch movies and TV in there.
- I can video call
I remember studying this paper as a student, was completely amazing, a bit mysterious and not so difficult to understand at the same time.
And most of all: no ML involved! All hail the heuristics!
ksala_ 7 hours ago | prev | next [–]
Shazam always blows my mind. It doesn't work 100% of the times, but when it does it feels like magic. On top of that they introduced (I don't know exactly when) the feature to see lyrics for the song which are automatically synched with the music. This is also mind-blowing.
Only Google has managed to top Shazam in blowing my mind, and only ~recently, by making this whole process happen completely offline and continuously in the background on a phone. It's not as broad but still incredible. Google's paper: https://arxiv.org/abs/1711.10958
msoad 10 hours ago | prev | next [–]
Shazam loads so freaking fast and ready to listen on my iPhone I really want to read an article on how they did it. It loads as fast as an empty hello world app but the button is ready to press and listen!
cannam 6 hours ago | root | parent | next [–]
That's interesting - I had a vague recollection of having heard of them before launch - I guess they were hiring from the pool of developers being laid off from the dotcom bust?
I have an image in my mind of my boss at the time going around the office asking if anyone was interested in talking to this thing called Shazam. I've long wondered if I imagined it. I certainly didn't act on it.
I remember (not much later than this) interviewing at a place where the product was intended to be "an automated assistant that listens to your phone call and pipes supporting information to your computer as you speak". Obviously I gave them a wide berth. It's funny to think about the "gap" in magic - Shazam seems magical but totally worked, this other idea seemed magical and, at the time, totally was.
rwmj 6 hours ago | root | parent | next [–]
I checked my email and the interview was actually in mid March 2002, not 2000/2001. I think still just before they did the initial launch of the premium phone service. Here's the job spec:
> Role: Senior software engineer - Low Level Device , Distributed Communications Role mission: To ensure that Shazam's subsystems are integrated and interface effectively and efficiently with external partners' systems/hosting environments, yielding available, robust and scalable full offerings. Key Performance Areas: 1. Design real time software using standard techniques and protocols, to be scalable, maintainable and robust 2. Manage & collaborate within and between team(s) 3. Implement quality software solutions within budget 4. Ensures that design and implementation of software is of high quality 5. Ensures that all deliverables are documented Required Skills/Capabilities
Knowledge of interfacing peripheral and devices to Linux
Knowledge of Linux device drivers a plus.
Distributed messaging techniques and protocols, eg: PVM, MPI
Ability to grasp and work with abstract concepts
Familiar with current software engineering methodologies e.g. RUP, XP
Understands and is able to manage quality assurance e.g., module tests, code review Required Knowledge/Previous Key Experience
At least 4 years of full-time software engineering within a team of at least 3 sofware engineers.
Must have been involved in all phases of the software cycle from requirements engineering to launch.
Must have developed low level device or communications software
Experience with Computer telephony a big plus
Experience with a high-growth startup environment a plus Ideal Qualifications Ideally University degree in Computer Science (alternatively at least 4 years of proven software engineering experience). Please forward your CV/resume', with cover e-mail, including full details of your earnings expectations, to recruit
shazamteam.com
spinningarrow 7 hours ago | prev | next [–]
Not Shazam but I remember a website back in the day called ‘The Song Tapper’ where you could press your space bar to the rhythm of a song in your head and it would suggest which song(s) it might be. Teenage me thought that was very cool.
The site seems no more but I found a Lifehacker post about it: https://lifehacker.com/find-the-name-of-a-song-by-tapping-14...
magicalhippo 8 hours ago | prev | next [–]
Reminds me of a fun IRC moment 20 years ago or so. A buddy had a song stuck in his head, but he couldn't recall the name of it.
I asked how it went, and he typed something like "du du duu duu du du duu du, du du duu duu du du duu du" and within 10 seconds I replied "oh, Tom's diner by Suzanne Vega?" After a few moments he replied "yes! how the hell?!"
Anyway, Shazam is great when out and about and I hear something I like. Clubs and other loud venues provide a challenge, but covering the mic usually does the trick.
I'd love to read some more details about how such fingerprinting works. I'm sure there are lots of interesting details on how it deals with recording noise and such.
TacticalCoder 8 hours ago | parent | next [–]
> I'm sure there are lots of interesting details on how it deals with recording noise and such.
There's more to Shazam than that but Fourier transforms gets rid of the noise. I ported a FFT to Java back in the days and it was, IIRC, not even 100 lines of code. Amazing algorithm. I used it to record engine noise under acceleration and then derive power/torque curve of my car (it took into account the number of cylinders): drive the car several times, both ways, on a street, record the noise. Apply the FFT. Input the rims size / gear ratio etc. And I'd end up with about the exact same plot as the official one from the car manufacturer.
Noise simply disappears with a FFT.
A more concerning issue is harmonics.
raamdev 9 hours ago | prev | next [–]
In the mid-ninties, around the time I had just become a teenager, I remember walking down the back corridor of a mall where my parents were leasing a space for their business and hearing a song playing overhead on the mall speakers that really caught my attention. I had no idea what the song was called or who made it, but I really liked it. I remember wishing I had some way to quickly find out, before the song ended, the name of the song and the artist. I remember thinking, "wouldn't it be great if this cell phone in my pocket could somehow tell me the name of this song?"
A decade later I discovered Shazam, and even today, more than a decade after that, Shazam still has a place on my home screen, quickly within reach, helping me discover hundreds of great artists and songs overheard from as many different places. The magic of the experience, and the appreciation for the technology, stem from the memory of that moment in the mid-nineties when I stood under a speaker listening to a song that I might never hear again.
Google's now playing feature is somehow always offline (to relieve privacy concerns) and is somehow still incredible at recognizing even obscure songs. Really impressive.
I also love that it just shows up on my lock screen.
tialaramex 1 hour ago | root | parent | next [–]
Supposedly while building the backend, they realised the actual summary data for a reasonable breadth of tracks (say, anything you'd likely hear on the radio or on a jukebox) was tiny and so, why build a service at all when you can just ship the data to phones ?
Recently for whatever reason I was listening to the twist/ cover "Stacy's Dad" and Now Playing recognised it as the rather more famous original Fountain's of Wayne "Stacy's Mom". So yeah, it doesn't know everything. It also doesn't recognise lots of obscure stuff I own that's like B-sides or special editions that never saw radio play, or bands that my friends were in (but everybody I know has read both Steve Albini's "Some of your friends are probably already this fucked" and the KLF's "The Manual" and so none of them signed a recording contract and thus you've never heard of them) but I've never had a situation where I heard something I liked at like a bar or somewhere and Now Playing didn't know what it is.
wincy 9 hours ago | parent | prev | next [–]
Heck yeah! If you’ve got an iPhone you literally just have to say “Hey Siri, what’s this song” and it’ll start listening and give you the Apple Music link. The only indication it’s Shazam is a little understated badge at the bottom.

Images Powered by Shutterstock

Shazam Turns 20

Services

Company

Support