Previous Work: Automatic Recognition of Camera Speech (ARCS)
In this ICSI project, researchers are working to improve speech recognition from noisy, often distorted audio taken from the body cameras of working police officers during traffic stops. This is part of a larger project at Stanford to extract information from these data. The Stanford project is focused on the analysis of the interactions between the officers and the communities they serve, in the hope that they could help to transform the relationship between the police and communities, produce solid data on officer-community interaction, and inform officer training programs. Preliminary analyses have used manual transcriptions, but this approach significantly limits the utility of results, given the significant costs and time delays involved in generating the transcriptions; furthermore, there are likely to be automatic measures that humans are not able to easily mark that will be useful for helping to categorize properties of interest, such as politeness and other affective categories.
The technical aspects of the speech recognition component of this project are quite challenging. Due to the practical issues of manually transcribing the data, there is a much smaller amount of word-labeled training data than ICSI scientists would prefer to have. Additionally, the audio is often significantly corrupted by wind noise, which is not only additive, but also commonly saturates the recording hardware, leading to clipped waveforms and significant distortions of the speech signal. The speech itself is different in conversational style than the larger databases that we would typically use to improve our models, including “conversational” data sets like Switchboard.
The PI expects to have to do much more than simply train models and use them to generate word sequences for Stanford to analyze, although of course this will be necessary. Ultimately, the team will need to take advantage of Stanford’s analysis to focus on those words that are most relevant for the classification of relevant affect. This will be done by incorporating Stanford’s linguistic results in the analysis of the recognizer’s lattice output, as opposed to just generating the 1-best word sequences.