How Shazam's Music Recognition System Design Actually Works

Music recognition system design sits at a nice intersection of signal processing and large-scale data systems. Apps like Shazam feel magical when they identify a song from a noisy five-second clip, but underneath is a tight pipeline: transform raw audio into a compact fingerprint, search that fingerprint efficiently, and map it back to human-friendly metadata.


From raw audio to fingerprints

Shazam does not store raw audio for every track. That would be expensive and slow to search. Instead it uses audio fingerprinting to convert each song into a compact digital summary that is fast to compare.

High-level pipeline

  1. Take a raw audio file.
  2. Clean it and convert it into a consistent digital format.
  3. Turn it into a spectrogram (time on x-axis, frequency on y-axis, intensity as amplitude).
  4. Detect distinctive peaks in that spectrogram.
  5. Hash those peaks into integers and store them as the song’s fingerprint.
  6. Keep human-readable metadata alongside the fingerprint.
sequenceDiagram
    participant User
    participant App as Shazam App
    participant FP as Fingerprinting Service
    participant NSQL as NoSQL Fingerprint DB
    participant SQL as SQL Metadata DB

    User->>App: Record short audio sample
    App->>FP: Send audio data
    FP->>FP: Preprocess audio (noise, normalize)
    FP->>FP: Spectrogram + peak detection
    FP->>FP: Generate fingerprint array
    FP->>NSQL: Query by fingerprint
    NSQL-->>FP: Matching fingerprint + songId
    FP->>SQL: Query metadata by songId
    SQL-->>FP: Title, artist, album, year
    FP-->>App: Match result
    App-->>User: Display identified song

This separation (fingerprint vs metadata) is key: the heavy signal processing happens once per track; most requests are just matching and lookup.


Audio preprocessing and feature extraction

Audio preprocessing

Before you can fingerprint, you need consistent input:

  • Noise reduction: suppress background sounds from cafes, cars, or chatter.
  • Volume normalization: quiet and loud recordings treated consistently.
  • Format conversion: standard digital format and sample rate.

The goal is to make different recordings of the same song look similar enough in the next steps.

Spectrogram and peak detection

Shazam computes a spectrogram:

Axis / propertyMeaning
x-axisTime
y-axisFrequency
intensityHow strong a frequency is at a specific time slice

From that spectrogram it picks peaks: time–frequency pairs where the energy is strong and distinctive. Example:

(t = 5s,  f = 500 Hz)
(t = 10s, f = 1000 Hz)
(t = 15s, f = 1500 Hz)

Those peaks are passed into a hash function that combines time and frequency into a single integer:

(5s, 500 Hz)  → 550
(10s, 1000 Hz) → 1010
(15s, 1500 Hz) → 1515

The fingerprint becomes a short array, e.g.:

[550, 1010, 1515]

This representation is small, robust to noise, and easy to index.

Waveform of a song and its spectrogram peak plot showing time on the x-axis and frequency on the y-axis.
Waveform (left) and spectrogram peak plot (right) for a track before turning peaks into an audio fingerprint.

Data model: NoSQL for fingerprints, SQL for metadata

Shazam splits storage into two different systems, each optimized for its job.

NoSQL for unstructured fingerprints

Fingerprints are unstructured numeric arrays. They live in a NoSQL database such as MongoDB. The fingerprint field acts as a partition key or query key. Example document:

{
  "id": "60d5ec9af682fbd12a892c99",
  "fingerprint": [500, 1000, 1500],
  "songId": "b7c9e2d7-5bfa-4cba-9e60-b6e1a4166e71"
}

The store is tuned for high write throughput and fast read queries on these fingerprints.

SQL for structured metadata

Song metadata is structured and fits well in a relational model. It lives in a SQL database such as MySQL. songId is the primary key. Example row from a Songs table:

ColumnValue
songIdb7c9e2d7-5bfa-4cba-9e60-b6e1a4166e71
titleBohemian Rhapsody
artistQueen
albumA Night at the Opera
year1975

Joins are simple: one table for songs, others for albums, artists, or licensing. This split lets each database do what it is good at: NoSQL for messy, scalable fingerprint search; SQL for clean relational metadata.


Query path when you tap the Shazam button

When a user wants to identify a track, the flow mirrors the ingest pipeline.

Step 1: Capture and fingerprint

  • The app records a short audio clip.
  • The backend runs: preprocessing → spectrogram generation → peak detection → fingerprint generation.

Example: the new sample yields a sequence of fingerprint arrays:

[500, 1000, 1500]
[505, 1005, 1505]
[510, 1010, 1510]

Step 2: Fingerprint lookup

The service queries the NoSQL database with the sample’s fingerprint values:

db.fingerprints.find({
  fingerprint: { $in: [500, 1000, 1500] }
})

That returns one or more candidate documents:

{
  "id": "60d5ec9af682fbd12a892c99",
  "fingerprint": [500, 1000, 1500],
  "songId": "b7c9e2d7-5bfa-4cba-9e60-b6e1a4166e71"
}

The important field is songId.

Step 3: Metadata lookup

Using that songId, the service queries the SQL database:

SELECT *
FROM Songs
WHERE songId = 'b7c9e2d7-5bfa-4cba-9e60-b6e1a4166e71';

That returns the human-friendly metadata, which is sent back to the app and shown to the user.


Matching with partial fingerprints

Real users often record only a few seconds, and the environment is noisy. Shazam’s design accepts that:

  • It does pattern matching on sequences of fingerprints.
  • Even if a sample contains fewer fingerprints than the full track, the system can match based on overlapping subsequences.
  • A sequence like the three arrays above can still be matched inside a much longer fingerprint sequence for the full song.

This tolerance is what makes the system robust when only a short, imperfect clip is available.


Scaling music fingerprinting

At scale, Shazam has to handle millions of tracks and continuous user queries. Strategies:

Parallel processing

  • Feature extraction and fingerprint generation are embarrassingly parallel.
  • Distributed workers can process different songs or segments at the same time.
  • This accelerates ingest when the catalog grows or when re-fingerprinting is needed.

Caching

  • Caches reduce repeated database hits for hot songs.
  • Frequently requested songId → metadata lookups can be served from in-memory caches like Redis.

Partitioning and indexing

  • Partitioning splits fingerprint data across shards so queries can fan out and run in parallel.
  • Indexes on fingerprint values and sequences make lookups fast even with large catalogs.

Together these patterns let music recognition stay fast as the catalog and user base expand.


Concept map of the architecture

graph TD
    A[Audio file] --> B[Preprocess audio]
    B --> C[Generate spectrogram]
    C --> D[Detect peaks]
    D --> E[Generate fingerprint array]
    E --> F[Store in NoSQL fingerprints DB]
    E --> G[Map to songId]
    G --> H[Store metadata in SQL Songs table]

One pipeline builds the fingerprints and metadata. At query time, the same steps up to E run on the user sample, then the system walks F and H in reverse to get back to a song.


Key takeaways

  • Audio fingerprinting converts songs into compact, peak-based fingerprints that are robust to noise and short samples.
  • Spectrograms and peak hashing turn time and frequency information into integers that are easy to store and index.
  • NoSQL is used for fingerprints because it handles large, unstructured arrays and high-throughput queries well.
  • SQL is used for metadata since song data is structured and benefits from relational modeling and primary keys.
  • Partial sequence matching lets the system identify songs from only a few seconds of audio.
  • Scaling comes from parallel processing, caching, partitioning, and indexing, which together keep lookup times low as the catalog grows.

[ RELATED_LOGS ]

TTFB: -- ms LOAD: -- s PAYLOAD: -- kb