Recently, I had the opportunity to visit our coffee roaster, Diantha's Coffee, a wholesale coffee roaster here in the San Francisco Bay Area, where I could see first-hand the coffee being roasted. They have an old fashioned roaster with considerable charm that has manual controls.
It turned out to be an interesting opportunity to apply machine-learning techniques to a novel process.
The Sound of Coffee Roasting
Roasting is in many ways a methodical process that takes skill and experience to get the right combination of mixtures of coffee beans, roasting temperatures, timing, and rapid cool downs.
One part of the process involves determining whether the beans have been sufficiently roasted. Depending upon the type of bean and the roast desired, there are different cycles of roasting and cool down.
One technique that smaller roasters use to assess where the beans are in their cycle is to listen to the sound of the roasting.
As beans heat up, the moisture in the beans start to exert pressure, until finally, the bean cracks and expands. This activity creates a crackling sound that heralds that the beans are nearing the end of their roasting cycle.
I wondered if I could apply machine-learning techniques to generate a signal that would be triggered by the sound of the crackling.
Although I have had some experience with neural networks in the past, I had recently taken the excellent on-line Stanford Machine Learning Class given by Andrew Ng, and I was primed to see the world through machine-learning eyes.
By the way, while I have attempted to adhere to careful, technical standards in performing the work, I deliberately avoid a standard academic-style write-up of my results with all the joy sucked out. I prefer that you can look over my shoulder as I pondered the problem, and we discover together what can be done.
Why Do This
Curiosity and fun are good reasons to do this.
I am also interested in ways that computers can interact with the real world. To justify this in a problem-solving way, large commercial roasters that use timers along with their expected temperature profiles for roasting beans. On average, the batches might be roasted correctly. However, if there is a distribution of moisture within specific types of beans, then some batches might not match the expected quality.
Smaller roasters might be interested, because, even if they normally monitor it manually, they could get busy and miss the signal. A device that could understand when the beans are done and automatically release the beans at the right time would be a benefit.
What follows is a description of an approach to solving the problem and the results. While the results are interesting, and potentially useful as an embedded device, this is about the journey of figuring out the puzzle.
My approach was to capture a recording of the roasting process, convert sound files that incorporated both unroasted and roasted (crackling), label the samples by listening for the crackling sound for which I will describe in the rest of my account as "roasted".
The purpose for the samples would be to create a set of data that would be used for supervised learning. Supervised learning in this case means that each data point is labeled as belonging to either an unroasted or roasted set. The object of the game is to derive a model that can successfully distinguish in which category any given data point belongs.
To that end, I would create six samples of sound. Three would come from the time prior to the crackling sound, and the other three would be taken from the crackling period. Then, the samples would be paired together back-to-back to make three sets of sounds.
One set would be the learning data, used to build the model. The next set would be the validation data. This would be used to judge the efficacy of the model parameters that were used. By adjusting the parameters I would seek to maximize the accuracy of the model for both the learning data and the validation data. By running the model using the validation data, it is a somewhat independent basis for seeing whether you can believe your model. However, as the model is adjusted for both the learning and validation data, the usefulness of the validation data is progressively curtailed, as the data is less of an independent source.
Therefore, when the model appears to function about as well as it can, then it would be run against the test data, which to this point has not been used at all. It is with the previously unused data that would suggest whether the model is truly sufficiently accurate.
I recorded the roasting process with my Android phone held an inch or two away from the roaster getting a sample that progresses from unroasted to crackling sounds.
Because I hadn't analyzed audio files before, I wasn't sure how to best approach the problem. The Android sound recorder records at 8000 Hz and makes a file with an ending of .amr. I was not able to directly use the file once it got to my computer, but I found a utility sox, which converted to an .au file. The .au filetype, according to Wikipedia, is a Sun format. Once the data was a .au file I could work with directly with both Octave and Python.
Since the nature of sound involves frequency, or patterns of changes over time, this means that for any given sound value, you can't tell anything about the sound without looking at other values around it to see the pattern. My choice was to either use recurrent neural networks or reprocess the values over short periods of time into features. I opted to reprocess the features for frequency.
Converting Sound to Features
What Features to Select
With five minutes of data, this represents at 8000 Hz about 2.4 million data points. While there is a lot of data, it is fairly uniform. The sample sizes for the learning data consisted of 40,000 points, while the validation and test data was 30,000 each.
One peculiarity of labeling sections of sound as crackling sounds, is that there are gaps in the sound, otherwise it would not crackle. Whatever process that is used to identify this characteristic has to be able to look back far enough in time to hear the crackle, but not so long that the features would become too large to comfortably process. If, for some reason, there was a random gap in the crackling, then the model would not correctly signal that the roasting was almost over.
The following spectrogram shows shows a 300 second sample. Most of the chart represents the low rumble of the drum turning as the coffee roasts.
The crackling sound corresponds to the slightly lighter section on the right. I actually expected the crackling to be a little more dramatic in the spectrogram, but at least it is discernible.
The darker vertical streaks on the far right represent the coffee being released from the drum and pouring out the chute which is a much louder crackling sound. Finally, the lighter region after the dark streaks is associated the coffee cooling down.
Using a look-back period of 1024 (1024/8000=0.128 seconds), I built up new feature samples that were a Fast Fourier Transform (FFT) of 512 points.
Comparing the samples of unroasted and roasted frequency signatures revealed that, on average, there were differences that might be susceptible to a process that would recognize the onset of the crackling.
The following chart shows the mean frequency pattern for unroasted and roasted.
While there are a couple areas that are fairly easily distinguishable, note that over the entire frequency range, that the roasted is more unsettled.
Building the Model
The structure of a neural network is a series of input nodes that connect to another layer called a hidden layer that is in turn connected another hidden layer or an output layer. Typically, all the nodes from the lower layer are connected to all the nodes of the next layer. What is meant by connected is that the value of a node is multiplied by a weight associated with a particular connection to the node in the next layer up. All of these values are calculated for a particular layer, then, for each node, the node is activated, which means that the value is pushed either lower or higher by an activation function. Once all the nodes are activated, then those activation values are used as the input values when fed up to the next layer. This process of activation is important, because without, the process is not really different from a normal linear regression. The activations enable non-linear effects.
Below is a diagram of a typical neural network. In the diagram the inputs enter on the left, the values are fed forward through the layers resulting in an output. In this case, the output would be either an indication of roasted or unroasted.
You might have noticed a detail on the diagram at the top where there is an unconnected node connected to the nodes of the next layer, but it does not appear to have inputs. These are called bias nodes. They always output a value of 1.0, but their impact is mitigated by the connection weight between the bias node and the next layer up.
The diagram, while interesting to look at, is difficult to interpret if you want to understand more about the values that your particular network is using. For example, the network configuration that I used, contains 512 inputs and 100 layers, resulting in more than 50,000 connection weights. Representing that on a chart becomes unwieldy. Farther down, I will show a better way to understand the network.
Processing the Model
Using the neural network, I then tried various combinations of hidden nodes and regularization (lambda) and comparing results with the validation sample. I used a batch method to get the advantages of vectorization, but it also limited my sample sizes due to running out of memory in Octave without more fussing with the programming.
I ran into difficulties because the learning data was selected from only one time period for unroasted, and one for roasted. It developed that it was way overfit. I did not want to increase the sample size much, because of the memory issues, and did not want rewrite the software to accomodate it in smaller batches. Finally, I changed out the unroasted learn data with small slices of features from a larger period for better representation.
The following chart shows a cost function by iterations for the final model selected. A cost function represents the degree that a model does not accurately predict the category correctly. As the weights in the network model are adjusted to account for the inaccuracies, the model improves. So, you would expect to see the model improve until a point of diminishing returns.
The cost function dropped dramatically to close to zero after about 10 iterations.
Because all nodes from one layer are connected to all nodes of the next layer, they can also be represented as a matrix of all the connection weights. While the node values are ephemeral, changing as each set of data points that enter the network, culminating in outputs, the connection weights are of most interest. It is these connection weights that determine how much the network is influenced by particular combinations of input values.
The chart below shows an examination of the neural net connection weights between the input layer and hidden layer. The input values correspond to the value of transforms On the left are the hidden nodes used. The color scaling is a way of graphically showing the connection weights with the spectrum centering on zero.
You can see the greatest weightings are associated with the greatest differences in the two frequency signatures. In fact, a lot of nodes seem like they could be removed without much loss, although I did not test this.
The next chart shows the same connection weights with connection weights of the hidden nodes to output. These nodes are much less subtle, weighting the signal steeply to one direction or another depending upon the hidden node activations.
The percent accuracy after the iterations for the learning data was 100.0%. When the model was applied to the validation data it was 98.8%.
It was at that point that I finally ran it against the test sample. Remember that that the test sample is an amalgammation of a sample from the unroasted section of sound back-to-back with a section from the roasted section.
The results was also 98.8%, which seemed like too much of a coincidence, and I spent a fair amount of time looking for the error where I accidentally duped the validation and test data. However, I confirmed that they are two distinct data sets, albeit from close periods in time.
Another aspect to the problem is how to use the result in a practical fashion. For example, there can be bad data, extraneous transitory noises that quickly disappear and do not constitute the fact that crackling has occurred. However, once the crackling occurs, you would expect a sustained set of signals that overwhelmingly indicate that the crackling is happening.
Below is a chart showing the test samples with the raw signals. There are a few times that it does not recognize that it is in crackling territory, but it appears to be sufficiently capable on the test sample.
Finally, the model was run against all of the recording to see how it would fare. Because there was so much data involved, it was not feasible to run in Octave. I instead used Python and loaded the network model with all the connection values into PyNeurGen, (a Python library that I first wrote a few years ago) and ran it in an on-line fashion. Below shows the results. As you can see, the signal corresponds quite closely to the original markings on the spectrogram.
As an exercise, this has been an interesting project. To actually put the model to good use would involve being able to listen and to process the results in real-time.
For a commercial roaster, it might have some applicability.
As a later project, I intend to see if I can do an embedded version of this. That will be interesting on a number of fronts, from converting the software to the selection of hardware components that would be unfazed by the heat associated with being attached to a roaster, to the selection of controls on the device that would be simple to control, yet useful in a variety of roasting scenarios.