Fit multiple Lorentzian peaks from spectrogram

This post shows of to fit an arbitrary number of Lorentzian peaks (Cauchy distribution) using scipy.

import inspect
import itertools
import numpy as np
import matplotlib.pyplot as plt
from scipy import stats, signal, optimize
from sklearn.metrics import r2_score

Model

The key to be able to plot an arbitrary number of peaks is to design model signature properly. First we define the peak model as we usually does when using curve_fit.

def peak(x, x0, s, A):
    return - A * stats.cauchy(loc=x0, scale=s).pdf(x)

Then construct a multiple peak model with a single vector of parameters that must be a multiple of peak parameters:

def model(x, *p):
    m = len(inspect.signature(peak).parameters) - 1
    n = len(p) // m
    y = np.zeros_like(x)
    for i in range(n):
        y += peak(x, *p[i*m:(i+1)*m])
    return y

Indeed this function will not fit for curve_fit as it expect explicit parameters, but minimize will do.

Dataset

Lets construct a synthetic dataset with 6 peaks:

x0s = [10, 25, 70, 100, 120, 140]
s0s = [1, 1.5, 2, 1, 0.5, 1]
A0s = [2, 3, 2, 3, 1, 1]
p0 = list(itertools.chain(*zip(x0s, s0s, A0s)))
# [10, 1, 2, 25, 1.5, 3, 70, 2, 2, 100, 1, 3, 120, 0.5, 1, 140, 1, 1]

np.random.seed(123456)
xlin = np.linspace(0, 150, 500)
y = model(xlin, *p0)
s = 0.005 * np.ones_like(xlin)
yn = y + s * np.random.normal(size=s.size)

Which renders as:

Peak identifications

Now we need to automatically identify how many peaks we have and where they are located.

peaks, bases = signal.find_peaks(-yn, prominence=0.2)
# (array([ 33,  83, 233, 333, 399, 466]),
#  {'prominences': array([0.63486587, 0.61350699, 0.30987885, 0.94338495, 0.63369006,
#          0.30995799]),
#   'left_bases': array([  4,  51, 156, 156, 369, 434]),
#   'right_bases': array([156, 156, 287, 497, 497, 497])})

Peaks have been properly identified but base boundaries are a bit messy, we can clean them up:

def clean_base_indices(lefts, rights):
    _lefts = np.copy(lefts)
    _rights = np.copy(rights)
    for i in range(len(lefts) - 1):
        if lefts[i + 1] < lefts[i]:
            _lefts[i + 1] = rights[i]
        if rights[i + 1] < rights[i]:
            _rights[i] = lefts[i + 1]
        if lefts[i] == lefts[i + 1]:
            _lefts[i + 1] = rights[i]
        if rights[i] == rights[i + 1]:
            _rights[i] = lefts[i + 1]
    return _lefts, _rights

lefts, rights = clean_base_indices(bases["left_bases"], bases["right_bases"])
# (array([  4,  51, 156, 287, 369, 434]), array([ 51, 156, 287, 369, 434, 497]))

The result of identification looks like:

Fit parametrization

Using the peak identification data, we create initial conditions vector and bounds to help to drive the minimization:

x0 = list(itertools.chain(*[[xlin[i], 0.1, 0.1] for i in peaks]))
bounds = list(itertools.chain(*[
    [(xlin[lefts[i]], xlin[rights[i]]), (0., np.inf), (0., np.inf)] for i in range(len(peaks))
]))

We state the loss function (normalized Chi Square):

def loss_factory(x, y, s):
    def wrapped(p):
        return np.sum(np.power((y - model(x, *p)) / s, 2)) / (x.size - len(p))
    return wrapped

loss = loss_factory(xlin, yn, s)

And we perform its minimization helped by our educated initial guess and boundaries:

sol = optimize.minimize(loss, x0=x0, bounds=bounds)
#       fun: 1.0955753119538643
#  hess_inv: <18x18 LbfgsInvHessProduct with dtype=float64>
#       jac: array([-2.73114841e-06, -4.70734565e-05,  6.31272816e-05,  1.57429612e-05,
#         3.73034939e-06, -2.41806576e-05,  1.29674131e-05, -5.59552408e-06,
#        -2.22932785e-05,  1.63980044e-04, -1.13486998e-04, -5.48450177e-06,
#         6.19504837e-05,  6.61692919e-05, -3.19744233e-06,  2.42916605e-05,
#        -2.20268249e-05,  1.53876912e-05])
#   message: 'CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH'
#      nfev: 836
#       nit: 41
#      njev: 44
#    status: 0
#   success: True
#         x: array([ 10.00233549,   1.01832565,   2.02107376,  24.99880996,
#          1.50914703,   3.01233404,  69.98152044,   2.04653551,
#          2.01075276,  99.98876123,   1.00539617,   3.00563652,
#        119.99496644,   0.50185139,   1.00415731, 140.00815263,
#          1.02170953,   1.00824265])

Which is a fairly good optimization:

yhat = model(xlin, *sol.x)
score = r2_score(yn, yhat)  # 0.9986663779143068

The final result looks like:

Model

Dataset

Peak identifications

Fit parametrization

jlandercy

Related Posts

Understanding PCA and implementing it step by step using python

Using Q-Q Plot to visually check distribution agreements

Find the closest points to a specific location using KD Tree