Diary of Fahmi Abdulhamid
January 2013: Fleshing out my ThesisI spent January writing the remainder of my thesis and revising existing content. Of note, I have written the Background and Feature Extraction chapters. Stuart has continued to provide me feedback which I have readily incorporated into my thesis. By the end of January, every chapter of my thesis, excepting the Conclusion, have arrived at, what I consider, my first draft. I have also begun setting up my final experiment. I plan to record myself saying the same lecture extract multiple times, with different words and phrases each time. I aim to measure how stable my segmentation algorithm is to sentence structure. I am currently setting up CMUSphinx 4 (an automatic speech recognition library written in Java) to time-align my speech recordings to a text transcript (I will plan what I will say beforehand). The challenge here is firstly to record clean and natural speech, and second to configure CMUSphinx 4 to transcribe what I say with reasonable accuracy (ideally > 80% correct).
December 2012: Enhancing my VisualisationI spent time working on my thesis, like last month. But I have also made some enhancements to my visualisation to resolve some the difficulties participants encountered in my user study. Context sentence popups which appear when hovering over Word Cloud words to provide context now display more content. Clicking the background of the visualisation no longer clears highlighted words. Instead, clicking on the background positions the audio to the point clicked.
November 2012: The Thesis beginsI spent November primarily writing my thesis. Most notably, I revised my thesis structure somewhat, fleshed out my Results chapter and completed my Design chapter. I have shown Stuart some of my work, and he has provided some helpful feedback.
October 2012: User Study AnalysisI spent my time analysing the results of my user study. At my disposal, I have recorded participant actions, participant comments, NASA Task Load Index (NASA-TLX) results, and notes I have taken while observing participants complete my user study. I wrote several R and Python scripts to analyse my results. Analysing NASA-TLX results and usage strategies between the highest and lowest scoring participants proved to be quite insightful. I also began to write my results into my Thesis.
August - September 2012: User testingIn these two months, I polished my visualisation to make it more aesthetically pleasing and started to write my thesis. I mostly developed the structure and wrote the introduction and user study chapters of my thesis. I later obtained approval from the Human Ethics Committee and booked the usability lab for my user study. In total, I had six participants take part in my pilot study and twenty (different) participants take part in the actual user study. The Pilot study resolved most of the issues with the user study but, due to a last-minute addition of a new task which partially gave away the answer, not all participants completed the corrected task :(. Anyway, I've now begun to evaluate my results. I aim to differentiate participants who performed well from those who didn't to identify differences in strategy when using the visualisation to find information.
July 2012: Getting ready for testingI focused on user testing this month. After some research into user testing software, I developed a couple user test plans and revised them with Stuart. The final test plan will record users as they complete a series of tasks on a small range of audio recordings, with a practise run beforehand. Each task will have the participant extract some information from an audio recording. Each participant will fill out a questionnaire at the end of the test. I did a lot of work to extend my visualisation to support the tasks I needed for my user study. My visualisation now records user events, is free of just about every bug I know of, and has a web.py server back-end to guide participants through the user study. I also looked at TAFE, my feature extraction back-end, to measure how its results are effected by varying the number of segments it produced. I averaged the results form three different audio recordings. It seams that the results are quite stable below 20 segments and really change at 20 segments. I assume the instability is due to the small segment sizes which make it difficult to find topics that occur over a long stretch of time. The small segment sizes may instead be identifying short and insignificant topics. I must further investigate to find out what is really going on.
- Word clouds display frequent words where word size and opacity correspond to frequency. Unlike significant phrases, frequent words display a greater breadth of detail about a segment and do not have to make sense when read as a sentence. Word clouds also promote skimming for efficiency.
- Hovering over a word in a word cloud displays the word in a sentence to provide context to the word. Context is not visible in a standard word cloud.
- Clicking on a word highlights when the stemmed word appears in the transcript. The word is given a colour so multiple words can be selected and displayed at the same time. Highlighting words in the transcript gives the user a finer level of detail about the subject(s) he or she is interested in.
- Audio controls at the bottom allow for a finer level to movement in the audio compared to the previous interface. The audio controls display segment and row boundaries which help to reliably compare segment duration and give cues to the user about a shift in visual context. Parts the user have listened to are highlighted to help with memory and navigation which is useful in a random-access audio interface.
- Controls to the right (not visible in the screenshot) enable highlighting of segments based on loudness and pitch if desired.
May 2012: It begins
Heretofore, I developed several prototypes and experiments, but not so much as I have done this month. With my task taxonomy in mind, I focused on which features should be used to help users navigate speech audio. I settled on six features, chosen as they may be simple to understand by users but still expressive enough to derive higher-level features.
Features extracted for speech audio visualisation:
- Audio features:
- Loudness (RMS energy)
- Pitch (Fundamental Frequency or f0)
- Text features:
- Time-stamped transcript (SubRip format)
- Speech rate (words per second)
- Sentence duration (in seconds)
- Segments (automatically identified by system)
The features were experimentally extracted using various systems: Marsyas (C++) for loudness and pitch, CMU Sphinx (Java) for transcript, Matlab for segment identification, and LingPipe (Java) for segment descriptors. My Audio Lights prototype was developed to integrate all my features into a simple tool to determine the feasibility of my features for visualisation.
Happy with the results, I developed several concept sketches for possible user interfaces, bearing in mind how the above features can be mapped to visual elements and my task taxonomy. My supervisor and I decided instead to augment a classical visualisation first, as creating a new visualisation is difficult.
I decided to implement a Strip Treemap. A Strip Treemap is a hierarchical Treemap layout algorithm that preserves the order of the data. But to implement this, I spent a week developing a feature extraction system that takes an audio file and it's accompanying transcript as input and produces all the features listed above as output. I payed attention to good system design, and ported my Marsyas and Matlab code to Java. Feature extraction algorithms were unit tested to help ensure correctness. I called the system "Transcript and Audio Feature Extractor", or " TAFE " for short.
With TAFE complete, I implemented a simple Strip Treemap with the extracted features! I will post my results after more experimentation.
March - May 2012: The story so far...
I dedicated much of March and April to background research and to building an understanding of basic audio concepts. I started with the assumption that visualisation of speech would be similar to visualisation of music (the analysis and display of acoustic properties) and so spent my time looking at musical structure visualisation and music information retrieval (MIR). I eventually learned that music and speech information are quite different beasts.
Music and speech audio have different characteristics and purposes. Music visualisation studies patterns, musical properties (such as rhythm and timbre), and for the most part is focused on specific genres such as classical music or electronica. Music visualisation is targeted toward music analysis and music library navigation, while speech audio is used to convey information, commonly for learning or as a memory aid.
Unlike music, Automatic Speech Recognition (or ASR) is commonly used to gain a new level of understanding about the audio; what is said is more important than how it is said or what it sounds like. Audio features are targeted toward prosody rather than pattern, and are not directly visualised as in music visualisaiton. Instead, tools supporting speech visualisation use audio features for segmentation, summarisation, and manipulation. In part, I believe, the lack of display of raw audio features for speech visualisation lies with the intended audience. Listeners of speech audio are concerned with navigation and retrieval, not so much with exploration or analysis as with music.
In the beginning of May, I identified a task taxonomy based in prior research in speech navigation tools. I now consider the below tasks as important for any tool that supports the navigation of speech audio.
Speech audio task taxonomy:
- Determining relevance – High-level, determined with audio meta-data (ID3) and speech introduction.
- Building a summary – High-level, determined by identifying introduction, content, and conclusion sections of audio.
- Identifying an important section – Mid-level, determined by speech transcript and acoustic features
- Identifying when a specific fact is given – Low-level, determined by speech transcript
|png||june_javaTreemap.png||manage||62 K||11 Jul 2012 - 15:30||Main.abdulhfahm||June - Audio Strip Treemap implemented in Java|
|png||BigContextSentencePopup.png||manage||44 K||01 Feb 2013 - 11:49||Main.abdulhfahm||December - Larger context sentence popup|
|png||TranscriptAnywhere.png||manage||219 K||01 Feb 2013 - 11:50||Main.abdulhfahm||December - Transcript Anywhere|