Rao Ma

Currently I’m in my second year of pursuing a PhD in Information Engineering at the Speech Group of University of Cambridge, supervised by Dr. Kate Knill. I have been working in the field of Automatic Speech Recognition (ASR) and Natural Langauge Processing (NLP) for more than 5 years. As the first author, I have successfully published research papers in top-tier AI conferences including ICASSP, Interspeech, and NAACL.

Before starting my PhD, I worked as a speech recognition engineer at the AI Lab of ByteDance. My role involved enhancing speech recognition technology for products such as TikTok and Capcut, leading to significant improvements in accuracy and reliability. I earned my B.S. Degree from Nanjing University in 2018, where I graduated as the top-ranked student among 150 peers. Subsequently, I completed my M.E. degree at Shanghai Jiao Tong University in 2021.

Research Interests

My current research focuses on speech and language processing, particularly exploring the opportunities and challenges presented by large-scale foundation speech models, including

  • Understanding: zero-shot prompting for speech-based models, contextual speech embeddings, cross-lingual understanding
  • Adaptation: soft prompt tuning, parameter-efficient fine-tuning (PEFT), elastic weight consolidation (EWC)
  • Downstream tasks: audio classification, speech translation, speech disfluency detection and removal, spoken grammatical error correction (GEC)
  • Enhancement: ASR error correction, extension for low-resource languages

If you are working on similar research topics and are interested in potential collaboration, don’t hesitate to reach out via email at rm2114[at]cam[dot]ac[dot]uk.

News

  • [March 15, 2024] Our paper about investigating the emergent audio classification ability of Whisper has been accepted for NAACL 2024. I would like to express my gratitude to my collaborator, Adian Liusie, for his equal and invaluable contribution to the paper. Looking forward to my trip to Mexico this summer 😄🌮🇲🇽 !
  • [January 25, 2024] My first PhD work was cited 10 times today 😁 Our work focused on building a robust ASR error correction system utilizing ASR N-best list and word lattices. Thanks to all my collaborators!
  • [January 23, 2024] Today, I shared our latest work with the AI/ML team at Nvidia. Big thanks to everyone who joined in!

Publications

Google Scholar †: Equal contribution

Peer-reviewed Conference

Rao Ma†, Adian Liusie†, Mark Gales, and Kate Knill, "Investigating the Emergent Audio Classification Ability of ASR Foundation Models". In 2024 North American Chapter of the Association for Computational Linguistics. [arXiv] [github] [slides]

Stefano Bannò, Rao Ma, Mengjie Qian, Kate Knill, and Mark Gales, "Towards End-to-End Spoken Grammatical Error Correction". In Proceedings of the 2024 IEEE International Conference on Acoustics, Speech and Signal Processing. [arXiv] [slides]

Mengjie Qian, Rao Ma, Adian Liusie, Erfan Loweimi, Kate Knill, and Mark Gales, "Zero-shot Audio Topic Reranking using Large Language Models". [arxiv]

Rao Ma, Mengjie Qian, Potsawee Manakul, Mark Gales, Kate Knill, "Can Generative Large Language Models Perform ASR Error Correction?". [arXiv]

Rao Ma, Mengjie Qian, Mark Gales, Kate Knill, "Adapting an ASR Foundation Model for Spoken Language Assessment". In The 9th Workshop on Speech and Language Technology in Education. [arXiv] [poster] [slides]

Rao Ma†, Mengjie Qian†, Mark Gales, Kate Knill, "Adapting an Unadaptable ASR System". In INTERSPEECH 2023. [arXiv] [poster]

Rao Ma, Mark Gales, Kate Knill, Mengjie Qian, "N-best T5: Robust ASR Error Correction using Multiple Input Hypotheses and Constrained Decoding Space". In INTERSPEECH 2023. [arXiv] [poster]

Rao Ma, Xiaobo Wu, Jin Qiu, Yanan Qin, Haihua Xu, Peihao Wu, Zejun Ma, "Internal Language Model Estimation based Adaptive Language Model Fusion for Domain Adaptation". In Proceedings of the 2023 IEEE International Conference on Acoustics, Speech and Signal Processing. [arXiv] [poster]

Yufei Liu, Rao Ma, Haihua Xu, Yi He, Zejun Ma, Weibin Zhang, "Internal Language Model Estimation Through Explicit Context Vector Learning for Attention-based Encoder-decoder ASR". In INTERSPEECH 2022. [arXiv]

Tian Tan, Yizhou Lu, Rao Ma, Sen Zhu, Jiaqi Guo, and Yanmin Qian, "AISpeech-SJTU ASR System for the Accented English Speech Recognition Challenge". In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing.

Houjun Huang, Xu Xiang, Yexin Yang, Rao Ma, Yanmin Qian, "AISpeech-SJTU Accent Identification System for the Accented English Speech Recognition Challenge". In Proceedings of the 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. [arXiv]

Rao Ma, Hao Li, Qi Liu, Lu Chen, and Kai Yu, "Neural Lattice Search for Speech Recognition". In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. [pdf]

Rao Ma, Lesheng Jin, Qi Liu, Lu Chen, and Kai Yu, "Addressing the Polysemy Problem in Language Modeling with Attentional Multi-Sense Embeddings". In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. [pdf]

Zihan Zhao, Yuncong Liu, Lu Chen, Qi Liu, Rao Ma, and Kai Yu, "An Investigation on Different Underlying Quantization Schemes for Pre-trained Language Models". In CCF International Conference on Natural Language Processing and Chinese Computing 2020. [arXiv]

Ruisheng Cao, Su Zhu, Chenyu Yang, Chen Liu, Rao Ma, Yanbin Zhao, Lu Chen, Kai Yu, "Unsupervised Dual Paraphrasing for Two-stage Semantic Parsing". In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. [arXiv]

Rao Ma, Qi Liu, and Kai Yu, "Highly Efficient Neural Network Language Model Compression Using Soft Binarization Training". In The 2019 IEEE Automatic Speech Recognition and Understanding Workshop. [pdf]

Journal Articles

Qi Liu, Rao Ma, and Kai Yu, "Markov Decision Process and Prior Control Vector for Weak Condition Natural Language Generation". In Chinese Journal of Computers (2022). [link]

Kai Yu, Rao Ma, Kaiyu Shi, and Qi Liu, "Neural Network Language Model Compression With Product Quantization and Soft Binarization". In IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020). [pdf]

Su Zhu, Zijian Zhao, Rao Ma, Kai Yu, "Prior Knowledge Driven Label Embedding for Slot Filling in Natural Language Understanding". In IEEE/ACM Transactions on Audio, Speech, and Language Processing 28 (2020). [arXiv]

Contact