Generative Spoken Dialogue Language Modeling

Tu-Anh Nguyen; Eugene Kharitonov; Jade Copet; Yossi Adi; Wei-Ning Hsu; Ali Elkahky; Paden Tomasello; Robin Algayres; Benoît Sagot; Abdelrahman Mohamed; Emmanuel Dupoux

doi:10.1162/tacl_a_00545

Generative Spoken Dialogue Language Modeling

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed, Emmanuel Dupoux

Correct Metadata for

Important: The Anthology treat PDFs as authoritative. Please use this form only to correct data that is out of line with the PDF. See our corrections guidelines if you need to change the PDF.

Title Adjust the title. Retain tags such as <fixed-case>.

Authors Adjust author names and order to match the PDF.

Abstract Correct abstract if needed. Retain XML formatting tags such as <tex-math>.

Verification against PDF Ensure that the new title/authors match the snapshot below. (If there is no snapshot or it is too small, consult the PDF.)

Authors concatenated from the text boxes above:

ALL author names match the snapshot above—including middle initials, hyphens, and accents.

Abstract

We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.1,2

Anthology ID:: 2023.tacl-1.15
Volume:: Transactions of the Association for Computational Linguistics, Volume 11
Month:
Year:: 2023
Address:: Cambridge, MA
Venue:: TACL
SIG:
Publisher:: MIT Press
Note:
Pages:: 250–266
Language:
URL:: https://aclanthology.org/2023.tacl-1.15/
DOI:: 10.1162/tacl_a_00545
Bibkey:
Cite (ACL):: Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. 2023. Generative Spoken Dialogue Language Modeling. Transactions of the Association for Computational Linguistics, 11:250–266.
Cite (Informal):: Generative Spoken Dialogue Language Modeling (Nguyen et al., TACL 2023)
Copy Citation:
PDF:: https://aclanthology.org/2023.tacl-1.15.pdf
Video:: https://aclanthology.org/2023.tacl-1.15.mp4

PDF Cite Search Video Fix data

Export citation

BibTeX
MODS XML
Endnote
Preformatted

@article{nguyen-etal-2023-generative,
    title = "Generative Spoken Dialogue Language Modeling",
    author = "Nguyen, Tu Anh  and
      Kharitonov, Eugene  and
      Copet, Jade  and
      Adi, Yossi  and
      Hsu, Wei-Ning  and
      Elkahky, Ali  and
      Tomasello, Paden  and
      Algayres, Robin  and
      Sagot, Beno{\^i}t  and
      Mohamed, Abdelrahman  and
      Dupoux, Emmanuel",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "11",
    year = "2023",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2023.tacl-1.15/",
    doi = "10.1162/tacl_a_00545",
    pages = "250--266",
    abstract = "We introduce dGSLM, the first {\textquotedblleft}textless{\textquotedblright} model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.1,2"
}

Download as File

<?xml version="1.0" encoding="UTF-8"?>
<modsCollection xmlns="http://www.loc.gov/mods/v3">
<mods ID="nguyen-etal-2023-generative">
    <titleInfo>
        <title>Generative Spoken Dialogue Language Modeling</title>
    </titleInfo>
    <name type="personal">
        <namePart type="given">Tu</namePart>
        <namePart type="given">Anh</namePart>
        <namePart type="family">Nguyen</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Eugene</namePart>
        <namePart type="family">Kharitonov</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Jade</namePart>
        <namePart type="family">Copet</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Yossi</namePart>
        <namePart type="family">Adi</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Wei-Ning</namePart>
        <namePart type="family">Hsu</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Ali</namePart>
        <namePart type="family">Elkahky</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Paden</namePart>
        <namePart type="family">Tomasello</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Robin</namePart>
        <namePart type="family">Algayres</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Benoît</namePart>
        <namePart type="family">Sagot</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Abdelrahman</namePart>
        <namePart type="family">Mohamed</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <name type="personal">
        <namePart type="given">Emmanuel</namePart>
        <namePart type="family">Dupoux</namePart>
        <role>
            <roleTerm authority="marcrelator" type="text">author</roleTerm>
        </role>
    </name>
    <originInfo>
        <dateIssued>2023</dateIssued>
    </originInfo>
    <typeOfResource>text</typeOfResource>
    <genre authority="bibutilsgt">journal article</genre>
    <relatedItem type="host">
        <titleInfo>
            <title>Transactions of the Association for Computational Linguistics</title>
        </titleInfo>
        <originInfo>
            <issuance>continuing</issuance>
            <publisher>MIT Press</publisher>
            <place>
                <placeTerm type="text">Cambridge, MA</placeTerm>
            </place>
        </originInfo>
        <genre authority="marcgt">periodical</genre>
        <genre authority="bibutilsgt">academic journal</genre>
    </relatedItem>
    <abstract>We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.1,2</abstract>
    <identifier type="citekey">nguyen-etal-2023-generative</identifier>
    <identifier type="doi">10.1162/tacl_a_00545</identifier>
    <location>
        <url>https://aclanthology.org/2023.tacl-1.15/</url>
    </location>
    <part>
        <date>2023</date>
        <detail type="volume"><number>11</number></detail>
        <extent unit="page">
            <start>250</start>
            <end>266</end>
        </extent>
    </part>
</mods>
</modsCollection>

Download as File

%0 Journal Article
%T Generative Spoken Dialogue Language Modeling
%A Nguyen, Tu Anh
%A Kharitonov, Eugene
%A Copet, Jade
%A Adi, Yossi
%A Hsu, Wei-Ning
%A Elkahky, Ali
%A Tomasello, Paden
%A Algayres, Robin
%A Sagot, Benoît
%A Mohamed, Abdelrahman
%A Dupoux, Emmanuel
%J Transactions of the Association for Computational Linguistics
%D 2023
%V 11
%I MIT Press
%C Cambridge, MA
%F nguyen-etal-2023-generative
%X We introduce dGSLM, the first “textless” model able to generate audio samples of naturalistic spoken dialogues. It uses recent work on unsupervised spoken unit discovery coupled with a dual-tower transformer architecture with cross-attention trained on 2000 hours of two-channel raw conversational audio (Fisher dataset) without any text or labels. We show that our model is able to generate speech, laughter, and other paralinguistic signals in the two channels simultaneously and reproduces more naturalistic and fluid turn taking compared to a text-based cascaded model.1,2
%R 10.1162/tacl_a_00545
%U https://aclanthology.org/2023.tacl-1.15/
%U https://doi.org/10.1162/tacl_a_00545
%P 250-266

Download as File

Markdown (Informal)

[Generative Spoken Dialogue Language Modeling](https://aclanthology.org/2023.tacl-1.15/) (Nguyen et al., TACL 2023)

Generative Spoken Dialogue Language Modeling (Nguyen et al., TACL 2023)

ACL

Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. 2023. Generative Spoken Dialogue Language Modeling. Transactions of the Association for Computational Linguistics, 11:250–266.