Hybrid RNN-T/Attention構造を用いたストリーミング型End-to-End音声認識モデルと内部言語モデル統合の検討

森谷崇史; 森谷崇史; 芦原孝典; 安藤厚志; 佐藤宏; 田中智大; 松浦孝平; 増村亮; DELCROIX Marc; 篠崎隆宏

文献

J-GLOBAL ID：202202221597615112 整理番号：22A1066811

Hybrid RNN-T/Attention構造を用いたストリーミング型End-to-End音声認識モデルと内部言語モデル統合の検討

A Study on Hybrid RNN-T/Attention-based Streaming ASR with Triggered Chunkwise Attention and Dual Internal Language Model Integration

出版者サイト {{ this.onShowPLink() }} 複写サービスで全文入手 {{ this.onShowCLink("http://jdream3.com/copy/?sid=JGLOBAL&noSystem=1&documentNoArray=22A1066811&COPY=1") }}
高度な検索・分析はJDreamⅢで {{ this.onShowJLink("http://jdream3.com/lp/jglobal/index.html?docNo=22A1066811&from=J-GLOBAL&jstjournalNo=U2030A") }}

著者 (10件)： , , , , , , , , ,
資料名：
巻： 121 号： 383(EA2021 64-97) ページ： 90-95 (WEB ONLY) 発行年： 2022年02月22日
JST資料番号： U2030A ISSN： 2432-6380 資料種別：会議録 (C)
記事区分：原著論文発行国：日本 (JPN) 言語：日本語 (JA)

本研究ではストリーミング音声認識におけるRecurrent neural network-transducer(RNN-T)とAttention-based decoder(AD)を組み合わせたHybrid RNN-T/Attentionモデルの改善手法について述べる.一般にADは注意重みの計算に始端から終端までの入力音声情報が必要なためストリーミング動作が困難であった.そこで我々は先行研究として始端から各triggerの位置までの音響特徴量を用いて注意重みを計算するTriggered attention-based decoder(TAD)と組み合わせることでストリーミング動作可能なHybrid RNN-T/Attentionモデルを提案した.しかしながら従来のTADではストリーミング処理を可能としたが,計算量やメモリ消費量に課題があった.本研究では認識精度を保ちながら計算コストが削減可能なTriggered chunkwise attention-based decoder(TCAD)を用いたHybrid RNN-T/Attentionモデルを提案する.また,本研究ではさらなる認識精度の改善に向けてHybrid RNN-T/Attentionモデルが持つ2種類の内部言語モデルを用いた言語モデルの統合方法についても検討を行なう.(著者抄録)

, , , , , , , , , ,
, , , , , ,

パターン認識 , 人工知能

引用文献 (24件)：

A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification : Labelling unsegmented sequence data with recurrent neural networks,” Proc. of ICML, pp.369-376, 2006.
A. Graves, “Sequence transduction with recurrent neural networks,” Proc. of ICML, 2012.
J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end continuous speech recognition using attention-based recurrent NN: first results,” Advances in NIPS, 2014.
T.N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li, M. Visontai, Q. Liang, T. Strohman, Y. Wu, I. Mc-Graw, and C. Chiu, “Two-pass end-to-end speech recognition,” Proc. of INTERSPEECH, pp.2773-2777, 2019.
T. Moriya, T. Tanaka, T. Ashihara, T. Ochiai, H. Sato, A. Ando, R. Masumura, M. Delcroix, and T. Asami, “Streaming end-to-end speech recognition for Hybrid RNN-T/Attention architecture,” Proc. of INTER-SPEECH, pp.1787-1791, 2021.

, , , , ,

前のページに戻る