SBIR Phase I: Exploring the Feasibility of Deployable Crowd-Powered Real-Time Captioning Supplemented with Automatic Speech Recognition

Period of Performance: 01/01/2015 - 12/31/2015


Phase 1 SBIR

Recipient Firm

Legion Labs LLC
1401 Beechwood Blvd
Pittsburgh, PA 15217
Firm POC, Principal Investigator


This SBIR Phase I project will investigate the feasibility of a high-quality speech-to-text service that combines the input of multiple non-expert human workers with the input of automatic recognition. Real-time captioning converts speech to text quickly (in less than five seconds), and is a vital accommodation that allows deaf and hard of hearing students to participate in mainstream classrooms and other educational activities. The current accepted approach for real-time captioning is to use expert human captionists (stenographers) that are very expensive ($150-300 per hour) and difficult to schedule. Computers can also convert speech to text via automatic speech recognition, but this technology is still unreliable in realistic settings and is likely to remain unreliable in the near- and medium-term future. This award will advance a higher-quality and more affordable alternative systems for real-time captioning that uses computation to coordinate multiple workers who can be more readily drawn from the existing labor force than highly specialized typing experts. This project will allow for increased access for deaf and hard of hearing people, resulting greater opportunities to participate in science and engineering. This in turn may afford deaf and hard of hearing people greater employment opportunities. The approach advanced by this project combines the partial captions provided on-demand by human workers using computation to convert speech to text with very low latencies (less than five seconds). Advances in human-computer interaction will allow each constituent worker to be directed to type only part of what he or she hears via both aural and visual cues, and will optimally adjust the playback rate of the audio to each worker's current typing speed. Novel algorithms based on multiple sequence alignment (often used in gene sequencing) will merge the resulting partial captions into a final output stream that can be forwarded back to the user. The incorporation of automatic speech recognition will further reduce costs and increase the scalability of the approach. As the service is developed and automatic speech recognition improves, the service will rely less on humans and more on computation, providing a path toward full automation in the future. This award will investigate the appropriateness and feasibility of captioning systems based on this approach by deploying it in the field, measuring the quality of the captions generated, and collecting qualitative feedback from deaf and hard of hearing students in science and engineering fields.