K. Drossos, S. Adavanne, and T. Virtanen, “Clotho: An audio captioning dataset,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2019, pp. 736-740.
A.-M. Oncescu, A.S. Koepke, J. Henriques, Z. Akata, and S. Albanie, “Audio retrieval with natural language queries,” in Proc. Interspeech, 2021
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervision,” arXiv preprint arXiv:2103.00020, 2021.
S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K Wilson, “CNN architectures for large-scale audio classification,” in Int. Conf. Acoust., Speech, Signal Process. (ICASSP), 2017.
V. Sanh, L. Debut, J. Chaumond, and T. Wolf, “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,” arXiv preprint arXiv:1910.01108. 2019.