Diary of Fahmi Abdulhamid

January 2013: Fleshing out my Thesis

I spent January writing the remainder of my thesis and revising existing content. Of note, I have written the Background and Feature Extraction chapters. Stuart has continued to provide me feedback which I have readily incorporated into my thesis. By the end of January, every chapter of my thesis, excepting the Conclusion, have arrived at, what I consider, my first draft.

I have also begun setting up my final experiment. I plan to record myself saying the same lecture extract multiple times, with different words and phrases each time. I aim to measure how stable my segmentation algorithm is to sentence structure. I am currently setting up CMUSphinx 4 (an automatic speech recognition library written in Java) to time-align my speech recordings to a text transcript (I will plan what I will say beforehand). The challenge here is firstly to record clean and natural speech, and second to configure CMUSphinx 4 to transcribe what I say with reasonable accuracy (ideally > 80% correct).

December 2012: Enhancing my Visualisation

I spent time working on my thesis, like last month. But I have also made some enhancements to my visualisation to resolve some the difficulties participants encountered in my user study.

Context sentence popups which appear when hovering over Word Cloud words to provide context now display more content. Clicking the background of the visualisation no longer clears highlighted words. Instead, clicking on the background positions the audio to the point clicked.

BigContextSentencePopup.png
Visualisation with large Context Sentence Popup display.

I added a separate button to clear highlighted words. It seems clicking on the background is a natural gesture that participants expect to move the audio.

I also added a Transcript Anywhere feature which displays a news-ticker-esk display of the text underneath the mouse. Transcript Anywhere appears on a mouse-hold gesture. Holding the mouse and moving left and right scrolls the text accordingly. I like the design because it shows the Transcript exactly where the user is gazing without obscuring information in the visualisation.

TranscriptAnywhere.png
Visualisation with Transcript Anywhere displayed at mouse position.

November 2012: The Thesis begins

I spent November primarily writing my thesis. Most notably, I revised my thesis structure somewhat, fleshed out my Results chapter and completed my Design chapter.

I have shown Stuart some of my work, and he has provided some helpful feedback.

October 2012: User Study Analysis

I spent my time analysing the results of my user study. At my disposal, I have recorded participant actions, participant comments, NASA Task Load Index (NASA-TLX) results, and notes I have taken while observing participants complete my user study.

I wrote several R and Python scripts to analyse my results. Analysing NASA-TLX results and usage strategies between the highest and lowest scoring participants proved to be quite insightful. I also began to write my results into my Thesis.

August - September 2012: User testing

In these two months, I polished my visualisation to make it more aesthetically pleasing and started to write my thesis. I mostly developed the structure and wrote the introduction and user study chapters of my thesis.

I later obtained approval from the Human Ethics Committee and booked the usability lab for my user study. In total, I had six participants take part in my pilot study and twenty (different) participants take part in the actual user study. The Pilot study resolved most of the issues with the user study but, due to a last-minute addition of a new task which partially gave away the answer, not all participants completed the corrected task :(.

Anyway, I've now begun to evaluate my results. I aim to differentiate participants who performed well from those who didn't to identify differences in strategy when using the visualisation to find information.

July 2012: Getting ready for testing

I focused on user testing this month. After some research into user testing software, I developed a couple user test plans and revised them with Stuart. The final test plan will record users as they complete a series of tasks on a small range of audio recordings, with a practise run beforehand. Each task will have the participant extract some information from an audio recording. Each participant will fill out a questionnaire at the end of the test.

I did a lot of work to extend my visualisation to support the tasks I needed for my user study. My visualisation now records user events, is free of just about every bug I know of, and has a web.py server back-end to guide participants through the user study.

I also looked at TAFE, my feature extraction back-end, to measure how its results are effected by varying the number of segments it produced. I averaged the results form three different audio recordings. It seams that the results are quite stable below 20 segments and really change at 20 segments. I assume the instability is due to the small segment sizes which make it difficult to find topics that occur over a long stretch of time. The small segment sizes may instead be identifying short and insignificant topics. I must further investigate to find out what is really going on.

June 2012: First steps

June was devoted to building working prototypes of a speech navigation visualisation/interface. Having TAFE up and running allowed me to focus on visualisation, usability, and utility of a speech navigation tool.

I began by building my first visualisation in Java. I utilised an existing Java library to create my Strip Treemap. You can see my Treemap below:

june javaTreemap.png
First visualisation prototype, written in Java.

Audio segments are represented as rectangles, progressing from left-to-right and row-by-row. Rectangle size corresponds to segment duration. In the center is a significant phrase (a phrase that occurs with high probability in one segment but with low probability in the remaining segments). Clicking on a segment would play that segment.

Initial feedback from the HCI group was positive with regard to the Strip Treemap visualisation as it made segments clear; significant phrases were not so well received. Significant phrases were unintelligible because words were stemmed. Further, similar segments were not evident because phrases were constrained to be dissimilar.

I must say, I do agree.

With these improvements in mind, I rebuilt the Audio Strip Treemap. This time in JavaScript/SVG/HTML5/CSS3 to give me more graphical power and freedom and to allow my tool to be easily shared. I used D3 for graphics, which meant I had to port the Strip Treemap layout algorithm to JavaScript. Below, I present the latest version of my audio navigation interface:

june javascriptTreemap.png
Second visualisation prototype, written in JavaScript.

Features of note include:

  • Word clouds display frequent words where word size and opacity correspond to frequency. Unlike significant phrases, frequent words display a greater breadth of detail about a segment and do not have to make sense when read as a sentence. Word clouds also promote skimming for efficiency.
  • Hovering over a word in a word cloud displays the word in a sentence to provide context to the word. Context is not visible in a standard word cloud.
  • Clicking on a word highlights when the stemmed word appears in the transcript. The word is given a colour so multiple words can be selected and displayed at the same time. Highlighting words in the transcript gives the user a finer level of detail about the subject(s) he or she is interested in.
  • Audio controls at the bottom allow for a finer level to movement in the audio compared to the previous interface. The audio controls display segment and row boundaries which help to reliably compare segment duration and give cues to the user about a shift in visual context. Parts the user have listened to are highlighted to help with memory and navigation which is useful in a random-access audio interface.
  • Controls to the right (not visible in the screenshot) enable highlighting of segments based on loudness and pitch if desired.

I plan to get ready for user testing next month.

May 2012: It begins

Heretofore, I developed several prototypes and experiments, but not so much as I have done this month. With my task taxonomy in mind, I focused on which features should be used to help users navigate speech audio. I settled on six features, chosen as they may be simple to understand by users but still expressive enough to derive higher-level features.

Features extracted for speech audio visualisation:

  • Audio features:
    • Loudness (RMS energy)
    • Pitch (Fundamental Frequency or f0)
  • Text features:
    • Time-stamped transcript (SubRip format)
    • Speech rate (words per second)
    • Sentence duration (in seconds)
    • Segments (automatically identified by system)

The features were experimentally extracted using various systems: Marsyas (C++) for loudness and pitch, CMU Sphinx (Java) for transcript, Matlab for segment identification, and LingPipe (Java) for segment descriptors. My Audio Lights prototype was developed to integrate all my features into a simple tool to determine the feasibility of my features for visualisation.

Happy with the results, I developed several concept sketches for possible user interfaces, bearing in mind how the above features can be mapped to visual elements and my task taxonomy. My supervisor and I decided instead to augment a classical visualisation first, as creating a new visualisation is difficult.

I decided to implement a Strip Treemap. A Strip Treemap is a hierarchical Treemap layout algorithm that preserves the order of the data. But to implement this, I spent a week developing a feature extraction system that takes an audio file and it's accompanying transcript as input and produces all the features listed above as output. I payed attention to good system design, and ported my Marsyas and Matlab code to Java. Feature extraction algorithms were unit tested to help ensure correctness. I called the system "Transcript and Audio Feature Extractor", or " TAFE " for short.

With TAFE complete, I implemented a simple Strip Treemap with the extracted features! I will post my results after more experimentation.

March - May 2012: The story so far...

I dedicated much of March and April to background research and to building an understanding of basic audio concepts. I started with the assumption that visualisation of speech would be similar to visualisation of music (the analysis and display of acoustic properties) and so spent my time looking at musical structure visualisation and music information retrieval (MIR). I eventually learned that music and speech information are quite different beasts.

Music and speech audio have different characteristics and purposes. Music visualisation studies patterns, musical properties (such as rhythm and timbre), and for the most part is focused on specific genres such as classical music or electronica. Music visualisation is targeted toward music analysis and music library navigation, while speech audio is used to convey information, commonly for learning or as a memory aid.

Unlike music, Automatic Speech Recognition (or ASR) is commonly used to gain a new level of understanding about the audio; what is said is more important than how it is said or what it sounds like. Audio features are targeted toward prosody rather than pattern, and are not directly visualised as in music visualisaiton. Instead, tools supporting speech visualisation use audio features for segmentation, summarisation, and manipulation. In part, I believe, the lack of display of raw audio features for speech visualisation lies with the intended audience. Listeners of speech audio are concerned with navigation and retrieval, not so much with exploration or analysis as with music.

In the beginning of May, I identified a task taxonomy based in prior research in speech navigation tools. I now consider the below tasks as important for any tool that supports the navigation of speech audio.

Speech audio task taxonomy:

  • Determining relevance – High-level, determined with audio meta-data (ID3) and speech introduction.
  • Building a summary – High-level, determined by identifying introduction, content, and conclusion sections of audio.
  • Identifying an important section – Mid-level, determined by speech transcript and acoustic features
  • Identifying when a specific fact is given – Low-level, determined by speech transcript

Topic attachments
I Attachment Action Size Date Who CommentSorted ascending
BigContextSentencePopup.pngpng BigContextSentencePopup.png manage 44 K 01 Feb 2013 - 11:49 Main.abdulhfahm December - Larger context sentence popup
TranscriptAnywhere.pngpng TranscriptAnywhere.png manage 219 K 01 Feb 2013 - 11:50 Main.abdulhfahm December - Transcript Anywhere
june_javaTreemap.pngpng june_javaTreemap.png manage 62 K 11 Jul 2012 - 15:30 Main.abdulhfahm June - Audio Strip Treemap implemented in Java
june_javascriptTreemap.pngpng june_javascriptTreemap.png manage 179 K 11 Jul 2012 - 15:33 Main.abdulhfahm June - Audio Strip Treemap implemented in JavaScript