Matthew Faytak
University at Buffalo
NTU invited workshop
Contours can be extracted from the ultrasound image using a combination of human-generated hints and automatic processing Iskarous (2005); Stone (2005)
Usually done on contours with a spline model which can handle non-linear patterns (like tongue shapes) Davidson (2006); Heyne et al (2019)
figure from Weller et al. (to appear)
Various types of feature engineering
All get around feature extraction and the need for (most) human intervention by focusing on image’s pixels as a set of features
Each ultrasound image is composed of tens of thousands of pixels, each of which has a numerical value indicating brightness
The pixels are arc-shaped in most ultrasound frames because the raw reflection data is stored as a rectangular grid Wrench & Scobbie (2008); Eshky et al (2021)
This grid is transformed to real-world proportions before we work with it figure from Eshky et al (2021)
Tongue position change means pixels change in brightness from frame to frame
The Euclidean distance of two frames in terms of all their pixels can be used as a measure of tongue movement figure from Palo (2019)
Defining PD more precisely Palo (2019):
We can calculate the difference over successive frames (step size 1), or over frames more separated in time (step size L)
for k = {F1, F2, …Fn − L}
The time associated with this measurement is the average of the time of the two involved frames, or 1/2(tFk + tFk + L)
Palo (2019) mainly uses a step size of 1
Movement detected includes intrinsic tongue muscles, unlike other measures discussed so far
Various psycholinguistic applications McMillan & Corley (2010); Palo (2019)
Related but more computationally complex: optical flow, which detects apparent motion between two frames Horn & Schunck (1981)
Integrating the velocity signal gives amount of displacement for rigid bodies (i.e. the larynx) Moisik et al (2014); figure from Faytak, Moisik & Palo (2021)
Dimensionality reduction carried out not on ultrasound contours, but on entire ultrasound frames
Brightness values can be used as raw data for dimensionality reduction
Each ultrasound frame can be thought of as a matrix of width w by height h
Each pixel at location x, y across data sets with the same frame size w × h can be thought of as a separate feature
Challenge for feature selection:
PCA outputs eigenvectors and eigenvalues
We might recall this was applied to numerous acoustic measures in our first notebook
Eigenvector loadings:
Data in PC space:
Method can be extended to image data: pixels in an image of shape w × h are treated as a very long list of features of length wh
Like in the notebook’s dataframe:
Convert our length wh eigenvectors back to w × h grids to get impact of associated PCs on pixel brightness in physical space Sirovich & Kirby (1987)
We could reshape our notebook’s eigenvectors, but because feature number doesn’t correspond to physical space, it doesn’t really gain us anything
When the features correspond to physical position (i.e. of pixels), eigenpictures tell us more: figures from Turk & Pentland (1991)
Eigenfaces show variation in shape and size of facial features, hair, and setting (lighting, pose, etc)
Hueber et al (2007) coined eigentongues, from eigenfaces
Another example, from all tongue postures in a corpus, at a lower spatial resolution figures from Hueber et al (2007)
We can now characterize our ultrasound image data set in terms of the PC scores for each eigentongue
On a practical level, avoids feature selection problems by making new ones; avoids time-consuming process of contour extraction
We can also do some neat tricks with eigentongues which aren’t easy with other approaches
Sequences of frames can be fed to PCA instead of frames at a single point of interest (i.e. midpoint); yields time series data Mielke et al (2017); Hoole & Pouplier (2017); Smith et al (2019)
Another dimensionality-reduction technique which eigentongue PC scores can be submitted to see Carignan (2019)
Weighted combinations of eigentongues can reconstruct observations Hoole & Pouplier (2017); Faytak et al. (2020)
Specifically, an image Γ can be reconstructed as linear combination of m eigentongues: Berry (2012)
where um is the mth eigentongue and ωm is the projection of Γ onto the mth eigentongue (i.e. PC score)
Creates a denoised version of the observation figures from Faytak et al. (2020)
Very efficient once the basics are mastered
Pixel methods easy to use on other data types
Fairly different from some approaches to analysis
Eigentongue analysis only works properly within single speakers
Our final lecture will cover a Python implementation of eigentongue methods from my recent work
If you are curious about how to implement pixel difference or optical flow in Python:
Berry, J. (2012). Machine learning methods for articulatory data. Doctoral dissertation, University of Arizona. PDF
Bregler, C. & Konig, Y. (1994). “Eigenlips” for robust speech recognition. In Proceedings of ICASSP ’94 Vol. 2. DOI
Danilouchkine, M., Mastik, F. & van der Steen, A. (2009). A study of coronary artery rotational motion with dense scale-space optical flow in intravascular ultrasound. Physics in Medicine and Biology, 54(6), 1397–1418. DOI
Carignan, C. (2019). TRACTUS (Temporally Resolved Articulatory Configuration Tracking of Ultrasound). Software. GitHub
Davidson, L. (2006). Comparing tongue shapes from ultrasound imaging using smoothing spline analysis of variance. The Journal of the Acoustical Society of America, 120, pp. 407–415. DOI
Eshky, A., Cleland, J., Ribeiro, M., Sugden, E., Richmond, K. & Renals, S. (2021). Automatic audiovisual synchronisation for ultrasound tongue imaging. Speech Communication, 132, 83-95. DOI
Faytak, M., Moisik, S. & Palo, P. (2021). The Speech Articulation Toolkit (SATKit): Ultrasound image analysis in Python. In Proceedings of ISSP 12, 234-237. PDF
Faytak, M., Liu, S. & Sundara, M. (2020). Nasal coda neutralization in Shanghai Mandarin: Articulatory and perceptual evidence. Laboratory Phonology, 11(1), 23. DOI
Heyne, M., Derrick, D., & Al-Tamimi, J. (2019). Native language influence on brass instrument performance: An application of generalized additive mixed models (GAMMs) to midsagittal ultrasound images of the tongue. Frontiers in Psychology, 2597. DOI
Hoole, P. & Pouplier, M. (2017). Öhman returns: New horizons in the collection and analysis of imaging data in speech production research. Computer Speech & Language, 45, 253-277. DOI
Horn, B., & Schunck, B. (1981). Determining optical flow. Artificial Intelligence, 17(1), 185–203. DOI
Hueber, T., Aversano, G., Chollet, G., Denby, B., Dreyfus, G., Oussar, Y., Roussel, P. & Stone, M. (2007). Eigentongue feature extraction for an ultrasound-based silent speech interface. In Proceedings of ICASSP ’07 Vol. 1. DOI
Iskarous, K. (2005). Detecting the edge of the tongue: A tutorial. Clinical Linguistics & Phonetics, 19(6-7), 555-565. DOI
Krause, P., Kay, C. & Kawamoto, A., (2020) Automatic motion tracking of lips using digital video and OpenFace 2.0, Laboratory Phonology 11(1), 9. DOI
McMillan, C. & Corley, M. (2010). Cascading influences on the production of speech: Evidence from articulation. Cognition, 117(3), 243–260. DOI
Mielke, J., Carignan, C. & Thomas, E. (2017). The articulatory dynamics of pre-velar and pre-nasal /æ/-raising in English: An ultrasound study. The Journal of the Acoustical Society of America, 142(1), 332-349. DOI
Moisik, S., Lin, H., & Esling, J. (2014). A study of laryngeal gestures in Mandarin citation tones using simultaneous laryngoscopy and laryngeal ultrasound (SLLUS). JIPA, 44(1), 21– 58. DOI
Oh, M., & Lee, Y. (2018). ACT: An Automatic Centroid Tracking tool for analyzing vocal tract actions in real-time magnetic resonance imaging speech production data. The Journal of the Acoustical Society of America, 144(4), EL290-EL296. DOI
Palo, P. (2019). Measuring pre-speech articulation. Doctoral dissertation, Queen Margaret University. PDF
Palo, P., Moisik, S. & Faytak, M. (2022). Speech Articulation Toolkit (SATKit). Software. GitHub
Smith, B., Mielke, J., Magloughlin, L. & Wilbanks, E. (2019) Sound change and coarticulatory variability involving English /ɹ/. Glossa: A Journal of General Linguistics 4(1), 63. DOI
Stone, M. (2005). A guide to analysing tongue motion from ultrasound images. Clinical Linguistics & Phonetics, 19(6-7), 455-501. DOI
Strycharczuk, P. & Scobbie, J. (2017). Whence the fuzziness? Morphological effects in interacting sound changes in Southern British English. Laboratory Phonology 8(1), 7. DOI
Turk, M. & Pentland, A. (1991). Eigenfaces for recognition. Journal of Cognitive Neuroscience, 3(1), 71-86. DOI
Weller, J., Faytak, M., Steffman, J., Mayer, C., Teixeira, G. & Tankou, R. (to appear). Supralaryngeal articulation across voicing and aspiration in Yemba vowels. In Proceedings of ACAL 51/52.
Wrench, A., & Scobbie, J. (2008). High-speed cineloop ultrasound vs. video ultrasound tongue imaging: Comparison of front and back lingual gesture location and relative timing. In Proceedings of ISSP 8. PDF