Frasier, without the (recorded) laughs…
At a recent dinner party, some friends and I wondered: What would the TV show Frasier be without the audience laugh track? And further, could machine learning be used to automatically mute out the undesired audience audio?
So, I built a tool to automatically mute the sections of audience laughter.
The repository includes a Jupyter notebook with detailed explanation, example video clip with processed audio, a trained model, and the CLI tool to apply this to your own audio files.
I wrote this in Python and used several open source libraries, notably Keras for describing a multi-layer recurrent neural network, librosa for audio feature extraction, and (of course) numpy to manipulate the data. It’s amazing what you can do in under 300 lines of code, thanks to contributions of the open source community.
What makes this an interesting problem, in my opinion, is that the “laugh track” audio is highly variable, real studio audience recordings, so properly classifying the audio required a more complex machine learning approach. Using a model that considered time-series data, rather than simple point-in-time consideration.
I used a number of fantastic tools and resources along the way – starting with Jupyter to rapidly prototype different approaches to the problem, and visualize the data as I went along. After I had built a solution I was happy with, I worked to convert it from a notebook, which is good for reference, exploration, and demonstration, to a standalone Python tool with a command line interface, which can be run on-demand to train or apply the model.
In theory, this tool could be trained on any two classes of audio, and then used to mute out one of those two classes. I don’t believe there is anything particularly special about identifying “laughs” vs other kinds of distinct audio! All my code is released under MIT license, so you are welcome to do whatever you’d like with it.