Comparison of Different Segmentation Methods used for Semi-Automatic Labeling

    Matti Karjalainen and Toomas Altosaar

    Helsinki University of Technology
    Laboratory of Acoustics and Audio Signal Processing
    P.O.Box 3000, FIN-02015 HUT, Finland


    Labeling of speech corpora is an integral part of spoken language research where reliable and accurate automated labeling is a highly desirable yet very difficult goal. In this paper we describe some of the different methods that we have developed and applied to this problem area. One method is based on using seed databases so that a given orthographic (textual) input is used to find the best diphone contexts in a seed database and then to apply feature-based match/alignment to find the most probable phone boundaries. Other approaches are based on specialized neural networks which are trained on the same seed databases to implicitly learn segment boundaries. Different post-processing methods, based on rules and search strategies, are used to obtain the forced final labeling. The different methods are compared to one another in terms of accuracy and computational terms, as well as to the typical performance achieved with standard hidden Markov model based labeling systems.