Matthew Faytak
University at Buffalo
NTU invited workshop
Contours can be extracted from the ultrasound image using a combination of human-generated hints and automatic processing Iskarous (2005); Stone (2005)
Usually done on contours with a spline model which can handle non-linear patterns (like tongue shapes) Davidson (2006); Heyne et al (2019)
figure from Weller et al. (to appear)
Various types of feature engineering
All get around feature extraction and the need for (most) human intervention by focusing on image’s pixels as a set of features
Each ultrasound image is composed of tens of thousands of pixels, each of which has a numerical value indicating brightness
The pixels are arc-shaped in most ultrasound frames because the raw reflection data is stored as a rectangular grid Wrench & Scobbie (2008); Eshky et al (2021)
This grid is transformed to real-world proportions before we work with it figure from Eshky et al (2021)
Tongue position change means pixels change in brightness from frame to frame
The Euclidean distance of two frames in terms of all their pixels can be used as a measure of tongue movement figure from Palo (2019)
Defining PD more precisely Palo (2019):
We can calculate the difference over successive frames (step size 1), or over frames more separated in time (step size L)
for k = {F1, F2, …Fn − L}
The time associated with this measurement is the average of the time of the two involved frames, or 1/2(tFk + tFk + L)
Palo (2019) mainly uses a step size of 1
Movement detected includes intrinsic tongue muscles, unlike other measures discussed so far
Various psycholinguistic applications McMillan & Corley (2010); Palo (2019)
Related but more computationally complex: optical flow, which detects apparent motion between two frames Horn & Schunck (1981)
Integrating the velocity signal gives amount of displacement for rigid bodies (i.e. the larynx) Moisik et al (2014); figure from Faytak, Moisik & Palo (2021)
Dimensionality reduction carried out not on ultrasound contours, but on entire ultrasound frames
Brightness values can be used as raw data for dimensionality reduction
Each ultrasound frame can be thought of as a matrix of width w by height h
Each pixel at location x, y across data sets with the same frame size w × h can be thought of as a separate feature
Challenge for feature selection:
PCA outputs eigenvectors and eigenvalues
We might recall this was applied to numerous acoustic measures in our first notebook
Eigenvector loadings:
Data in PC space:
Method can be extended to image data: pixels in an image of shape w × h are treated as a very long list of features of length wh
Like in the notebook’s dataframe:
Convert our length wh eigenvectors back to w × h grids to get impact of associated PCs on pixel brightness in physical space Sirovich & Kirby (1987)
We could reshape our notebook’s eigenvectors, but because feature number doesn’t correspond to physical space, it doesn’t really gain us anything
When the features correspond to physical position (i.e. of pixels), eigenpictures tell us more: figures from Turk & Pentland (1991)
Eigenfaces show variation in shape and size of facial features, hair, and setting (lighting, pose, etc)
Hueber et al (2007) coined eigentongues, from eigenfaces
Another example, from all tongue postures in a corpus, at a lower spatial resolution figures from Hueber et al (2007)
We can now characterize our ultrasound image data set in terms of the PC scores for each eigentongue
On a practical level, avoids feature selection problems by making new ones; avoids time-consuming process of contour extraction
We can also do some neat tricks with eigentongues which aren’t easy with other approaches
Sequences of frames can be fed to PCA instead of frames at a single point of interest (i.e. midpoint); yields time series data Mielke et al (2017); Hoole & Pouplier (2017); Smith et al (2019)
Another dimensionality-reduction technique which eigentongue PC scores can be submitted to see Carignan (2019)
Weighted combinations of eigentongues can reconstruct observations Hoole & Pouplier (2017); Faytak et al. (2020)
Specifically, an image Γ can be reconstructed as linear combination of m eigentongues: Berry (2012)
where um is the mth eigentongue and ωm is the projection of Γ onto the mth eigentongue (i.e. PC score)
Creates a denoised version of the observation figures from Faytak et al. (2020)
Very efficient once the basics are mastered
Pixel methods easy to use on other data types
Fairly different from some approaches to analysis
Eigentongue analysis only works properly within single speakers
Our final lecture will cover a Python implementation of eigentongue methods from my recent work
If you are curious about how to implement pixel difference or optical flow in Python:
