Dr Yoshi Gotoh

PhD

School of Computer Science

Lecturer

Student Projects Officer

Member of the Speech and Hearing (SpandH) research group

Yoshi Gotoh
Profile picture of Yoshi Gotoh
y.gotoh@sheffield.ac.uk

Full contact details

Dr Yoshi Gotoh
School of Computer Science
Regent Court (DCS)
211 Portobello
Sheffield
S1 4DP
Profile

Yoshi is a lecturer in the Department of Computer Science. He has a first degree in Engineering form the University of Tokyo and a PhD from Brown University.

Research interests

Yoshi has been working in the field of speech and spoken language processing for years. His current interests include audio visual processing, in particular, video analysis and video information retrieval.

Publications

Journal articles

  • Al Ghamdi M & Gotoh Y (2020) . Machine Vision and Applications, 31.
  • Khan MUG & Gotoh Y (2017) . Machine Vision and Applications, 28(3-4), 243-265.
  • Khan MUG, Nasir A, Riaz O, Gotoh Y & Amiruddin M (2016) A statistical model for annotating videos with human actions. Pakistan Journal of Statistics, 32(2), 109-123.
  • Khan M, AlHarbi N & Gotoh Y (2015) . Information Sciences, 303, 61-82.
  • Al Harbi N & Gotoh Y (2015) . Neurocomputing, 161, 56-64.
  • Zhang L, Gotoh Y & Khan M (2012) . International Journal of Advanced Robotic Systems, 9.
  • Kolluru B & Gotoh Y (2009) . NAT LANG ENG, 15, 193-213.
  • Punitha P, Misra H, Ren R, Hannah D, Goyal A, Villa R & Jose JM (2009) Glasgow University at TRECVID 2009. 2009 Trec Video Retrieval Evaluation Notebook Papers.
  • Christensen H, Gotoh Y & Renals S (2008) . IEEE T AUDIO SPEECH, 16(1), 151-161.
  • Gotoh Y & Renals S (2000) Information extraction from broadcast news. PHILOS T ROY SOC A, 358(1769), 1295-1309.
  • Gotoh Y & Renals S (1999) Topic-based mixture language modelling. Natural Language Engineering, 5(4), 355-375.
  • Gotoh Y, Hochberg MM & Silverman HF (1998) Efficient training algorithms for HMM's using incremental estimation. IEEE T SPEECH AUDI P, 6(6), 539-548.
  • Charniak E, Carroll G, Adcock J, Cassandra A, Gotoh Y, Katz J, Littman M & McCanna J (1996) . Artificial Intelligence, 85(1-2), 45-57.
  • Charniak E, Caroll G, Adcock J, Cassandra A, Gotoh Y, Katz J, Littman M & McCann J (1996) . Artificial Intelligence, 84(1-2), 357-357.
  • Mashao D, Gotoh Y & Silverman HF (1996) . IEEE Signal Processing Letters, 3(4), 103-106.

Conference proceedings

  • Clarke J, Gotoh Y & Goetze S (2025) . Speech and Computer: 27th International Conference, SPECOM 2025, Szeged, Hungary, October 13-15, 2025, Proceedings, Part II, Vol. LNAI 16188 (pp 289-301). Szeged, Hungary, 13 October 2025 - 13 October 2025.
  • Clarke J, Gotoh Y & Goetze S (2024) . 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). Taipei, Taiwan, 16 December 2023 - 16 December 2023.
  • Alrashidi A, Cudd P, Abhayaratne C & Gotoh Y (2023) . CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (pp 110). Hamburg, Germany, 23 April 2023 - 23 April 2023.
  • Alvi M, Khan MUG, Gotoh Y, Sadiq M & Aslam M (2020) University of Engineering & Technology, Lahore the University of Sheffield at TRECVID 2015: Instance search. 2015 TREC Video Retrieval Evaluation, TRECVID 2015
  • Amanat S, Khan MUG, Nida N & Gotoh Y (2020) The University of Sheffield and University of Engineering & Technology, Lahore at TECVID 2014: Instance search task. 2014 TREC Video Retrieval Evaluation, TRECVID 2014
  • Algadhy R, Gotoh Y & Maddock S (2019) . ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp 2367-2371). Brighton, UK, 12 May 2019 - 12 May 2019.
  • Al Ghamdi M & Gotoh Y (2019) Graph-based correlated topic model for motion patterns analysis in crowded scenes from tracklets. British Machine Vision Conference 2018, BMVC 2018
  • Al Ghamdi M & Gotoh Y (2018) . 2018 IEEE Winter Conference on Applications of Computer Vision (pp 1029-1037). Lake Tahoe, NV/CA, 12 March 2018 - 12 March 2018.
  • Al Ghamdi M & Gotoh Y (2018) Graph-based correlated topic model for motion patterns analysis in crowded scenes from tracklets. British Machine Vision Conference 2018 Bmvc 2018
  • Khan MUG, Gotoh Y & Nida N (2017) . Medical Image Understanding and Analysis, Vol. 723 (pp 571-580)
  • Al Harbi N & Gotoh Y (2017) Natural language descriptions for human activities in video streams. Proceedings of the 10th International Conference on Natural Language Generation (pp 85-94). Santiago de Compostela, Spain, 4 September 2017 - 4 September 2017.
  • Al Harbi N & Gotoh Y (2016) Natural language descriptions of human activities scenes: Corpus generation and analysis. Proceedings of the Annual Meeting of the Association for Computational Linguistics (pp 39-47)
  • Algadhy R, Gotoh Y & Maddock S (2016) Analysis of visemes in the GRID corpus. Abstract of UKspeech
  • Masrani A & Gotoh Y (2016) Overlapped interest and the impact of visual and audio information in the human perception. Abstract of UKspeech
  • Wahla SQ, Waqar S, Ghani Khan MU & Gotoh Y (2016) The University of Sheffield and University of Engineering & Technology, Lahore at TRECVID 2016: Video to text description task. 2016 Trec Video Retrieval Evaluation Trecvid 2016
  • Masrani A & Gotoh Y (2015) Corpus generation and analysis: incorporating audio data towards curbing missing information. Proceedings of KDWEB
  • Al Harbi N & Gotoh Y (2015) Describing spatio-temporal relations between object volumes in video streams. Aaai Workshop Technical Report, Vol. WS-15-14 (pp 2-8)
  • Alvi M, Khan MUG, Gotoh Y, Sadiq M & Aslam M (2015) University of Engineering & Technology, Lahore the University of Sheffield at TRECVID 2015: Instance search. 2015 Trec Video Retrieval Evaluation Trecvid 2015
  • Al Ghamdi M & Gotoh Y (2014) . ICISP. Cherbourg, 30 June 2014.
  • Al Ghamdi M & Gotoh Y (2014) Alignment of nearly-repetitive contents in a video stream with manifold embedding. ICASSP. Firenze
  • Al Ghamdi M & Gotoh Y (2014) Video clip retrieval by graph matching. ECIR. Amsterdam
  • Amanat S, Khan MUG, Nida N & Gotoh Y (2014) The University of Sheffield and University of Engineering & Technology, Lahore at TECVID 2014: Instance search task. 2014 Trec Video Retrieval Evaluation Trecvid 2014
  • Al Harbi N & Gotoh Y (2013) Action recognition: spatio-temporal human body region tracking approach. CAIP - REACTS workshop. York
  • Al Ghamdi M & Gotoh Y (2013) Spatio-temporal manifold embedding for nearly-repetitive contents in a video stream. CAIP. York
  • Al Harbi N & Gotoh Y (2013) Spatio-temporal human body segmentation from video stream. CAIP. York
  • Khan MUG, Bashir K, Shah AA, Zhang L, Gotoh Y, Khan PI & Amiruddin M (2013) The University of Sheffield, Harbin Engineering University and University of Engineering & Technology, Lahore at TRECVID 2013: Instance search & semantic indexing. 2013 Trec Video Retrieval Evaluation Trecvid 2013
  • Khan M, Bashir K, Shah A, Zhang L, Gotoh Y, Khan P & Amiruddin M (2013) The University of Sheffield, Harbin Engineering University and University of Engineering & Technology, Lahore at TRECVID 2013: Instance Search & Semantic indexing. TRECVID
  • Al Ghamdi M, Khan M, Zhang L & Gotoh Y (2012) The University of Sheffield and Harbin Engineering University at TRECVID 2012: Instance Search. TRECVID
  • Khan M, Zhang L & Gotoh Y (2011) Human focused video description. ICCV - VECTaR workshop. Barcelona
  • Zhang L, Khan M & Gotoh Y (2011) Video scene classification based on natural language description. ICCV - ARTEMIS workshop. Barcelona
  • Khan M, Zhang L & Gotoh Y (2011) Towards coherent natural language description of video streams. ICCV - SIG workshop. Barcelona
  • Chantamunee S & Gotoh Y (2010) Nearly-repetitive video synchonisation using nonlinear manifold embedding. ICASSP. Dallas
  • Chantamunee S & Gotoh Y (2008) University of Sheffield at TRECVID 2008: Rushes Summarisation and Video Copy Detection.. TRECVID
  • Chantamunee S & Gotoh Y (2008) Shot alignment in pre-production video. MLMI. Utrecht
  • Chantamunee S & Gotoh Y (2007) University of Sheffield at TRECVID 2007: Shot Boundary Detection and Rushes Summarisation.. TRECVID
  • Kolluru B & Gotoh Y (2007) Speaker Role Based Structural Classification of Broadcast ¾Ã²Ý¸£Àû Stories. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4 (pp 141-144)
  • Kolluru B & Gotoh Y (2007) Relative Evaluation of Informativeness in Machine Generated Summaries. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4 (pp 145-148)
  • Kolluru B, Christensen H & Gotoh Y (2005) Mutli-stage compaction approach to broadcast news summarisation. Interspeech. Lisbon
  • Kolluru B & Gotoh Y (2005) On the subjectivity of human authored short summaries. ACL Workshop: Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarizati. Ann Arbor
  • Christensen H, Kolluru BK, Gotoh Y & Renals S (2005) Maximum entropy segmentation of broadcast news. 2005 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS 1-5 (pp 1029-1032)
  • Kolluru B, Christensen H & Gotoh Y (2004) Decremental feature-based compaction. DUC Workshop. Boston
  • Christensen H, Kolluru BK, Gotoh Y & Renals S (2004) From text summarisation to style-specific summarisation for broadcast news. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, Vol. 2997 (pp 223-237)
  • Christensen H, Gotoh Y, Kolluru B & Renals S (2003) Are extractive text summarisation techniques portable to broadcast news?. ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03 (pp 489-494)
  • Kolluru B, Christensen H, Gotoh Y & Renals S (2003) Exploring the style-technique interaction in extractive summarization of broadcast news. ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03 (pp 495-500)
  • Gotoh Y & Renals S (2003) Statistical language modelling. TEXT- AND SPEECH-TRIGGERED INFORMATION ACCESS, Vol. 2705 (pp 78-105)
  • Christensen H, Gotoh Y & Renals S (2001) Punctuation Annotation Using Statistical Prosody Models. Proceedings of the ISCA Workshop on Prosody in Speech Recognition and Understanding (pp 35-40)
  • Gotoh Y & Renals S (2000) Sentence boundary detection in broadcast speech transcripts. ISCA ASR Workshop. Paris
  • Gotoh Y & Renals S (2000) Variable word rate n-grams. 2000 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS, VOLS I-VI (pp 1591-1594)
  • Renals S & Gotoh Y (1999) Integrated transcription and identification of named entities in broadcast speech. Eurospeech. Budapest
  • Gotoh Y & Renals S (1999) Statistical annotation of named entities in spoken audio. ESCA Workshop: Accessing Information in Spoken Audio. Cambridge
  • Gotoh Y, Renals S & Williams G (1999) Named entity tagged language models. ICASSP '99: 1999 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, PROCEEDINGS VOLS I-VI (pp 513-516)
  • Gotoh Y & Renals S (1997) Document space models using latent semantic analysis. Eurospeech. Rhodes
  • Adcock J, Gotoh Y, Mashao D & Silverman HF (1996) Microphone-array speech recognition via incremental MAP training.. ICASSP. Atlanta
  • Gotoh Y & Silverman HF (1996) Incremental ML estimation of HMM parameters for efficient training. ICASSP. Atlanta
  • Gotoh Y, Hochberg MM, Mashao D & Silverman HF (1995) Incremental MAP estimation of HMMs for efficient training and improved performance. ICASSP. Detroit
  • Gotoh Y, Hochberg MM & Silverman HF (1994) Using MAP estimated parameters to improve HMM speech recognition performance. ICASSP. Adelaide
  • Clarke J, Gotoh Y & Goetze S () Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings. Proceedings of the European Signal Processing Conference
  • Clarke J, Gotoh Y & Goetze S () Speaker Embedding Informed Audiovisual Active Speaker Detection for Egocentric Recordings. Proceedings of the ... IEEE International Conference on Acoustics, Speech, and Signal Processing / sponsored by the Institute of Electrical and Electronics Engineers Signal Processing Society. ICASSP (Conference)
  • Khan M, Al Harbi N & Gotoh Y () Natural language descriptions for video streams. V&L Net Workshop. Sheffield, December 2012.
  • Al Ghamdi M, Zhang L & Gotoh Y () Spatio-temporal SIFT and its application to human action classification. ECCV - VECTaR workshop. Firenze, October 2012.
  • Al Ghamdi M, Al Harbi N & Gotoh Y () Spatio-temporal video representation with locality-constrained linear coding. ECCV - ARTEMIS workshop. Firenze, October 2012.
  • Khan M, Zhang L & Gotoh Y () Generating coherent natural language annotations for video streams. ICIP. Orlando, September 2012.
  • Khan M & Gotoh Y () Natural language descriptions of visual scenes: corpus generation and analysis. EACL workshop. Avignon, April 2012.
  • Khan M & Gotoh Y () Describing video contents in natural language. EACL workshop. Avignon, April 2012.
  • Kolluru B & Gotoh Y () . Interspeech 2007 (pp 2593-2596)
  • Kolluru B & Gotoh Y () . Interspeech 2007 (pp 1338-1341)

Working papers

  • Urban J, Hilaire X, Hopfgartner F, Villa R, Jose JM, Chantamunee S & Gotoh Y (2006) Glasgow University at TRECVID 2006. TRECVID 2006 - Text REtrieval Conference TRECVid Workshop, 363-367.
Grants
  • Visual Understanding for Fake Imagery Detect, Innovate UK, 09/2021 - 03/2024, £218,226, as Co-PI
  • Multimedia Analysis for Unsupervised Dubbing In Entertainment (MAUDIE), Innovate UK, 04/2018 - 03/2021, £393,115, as Co-PI
  • S3L: Statistical Summarization of Spoken Language, EPSRC, 12/2001 - 09/2005, £284,248, as Co-PI
Professional activities and memberships

Member of the  research group