Introduction

To fuel the Spoken Language Processing (SLP) research on meetings and tackle the key challenges, the Speech Lab and the Language Technology Lab and the ModelScope Community of Alibaba Group, Alibaba Cloud Tianchi Platform, and Zhejiang University launch a General Meeting Understanding and Generation (MUG) challenge, as an ICASSP2023 Signal Processing Grand Challenge. The MUG challenge includes five tracks: Track 1 Topic Segmentation (TS), Track 2 Topic-level and Session-level Extractive Summarization (ES), Track 3 Topic Title Generation (TTG), Track 4 Keyphrase Extraction (KPE), and Track 5 Action Item Detection (AID). We conduct data privacy review on the AliMeeting4MUG Corpus and make sure that the AliMeeting4MUG Corpus has no personally identifiable information nor private information.

To facilitate the MUG benchmark, we construct and release a meeting dataset, the AliMeeting4MUG Corpus (AMC), which consists of 654 recorded Mandarin meeting sessions with diverse topic coverage, with manual annotations for SLP tasks on manual transcripts of meeting recordings. As compared to the existing meeting corpora supporting SLP tasks in Table1, to the best of our knowledge, AMC is so far the largest meeting corpus in scale and facilitates the most SLP tasks. We build baseline systems and report evaluation results on the MUG tasks. Next, we introduce AMC, provide data analysis, and describe data loader.

Data Collection and Annotations

The AliMeeting4MUG Corpus (AMC) comprises 654 meetings, including 524 meetings with all five SLP annotations (TS, ES, TTG, KPE, and AID) and 130 meetings with TS annotation only. Each meeting session consists of a 15- to 30-min discussion by 2-4 participants covering certain topics. We create manual transcripts for audio together with manually inserted punctuation and manual speaker labels, with careful quality control [5]. Next, we segment documents into paragraphs using our text segmentation model [6]. Then, we conduct manual annotations for the five SLP tasks (TS, ES, TTG, KPE, and AID).

For Track 2 to Track 5, we randomly partition the 524 meeting sessions with manual annotations for all five SLP tasks into 295 sessions for training (Train), 65 sessions for system development (Dev), 82 sessions (exceptTS-Test1) as First Stage Evaluation Data for regular update of ranking during the challenge and 82 sessions (exceptTS-Test2) as Final Evaluation Data for final leaderboard ranking. For Track 1 Topic Segmentation, we randomly partition the 130 meeting sessions with TS annotation only into 65 sessions (TS-Test1) as First Stage Evaluation Data for Track TS regular update of ranking and 65 sessions (TS-Test2) as Final Evaluation Data for Track TS final leaderboard ranking.

验证报告

以下为卖家选择提供的数据验证报告：

Alimeeting4MUG数据集

￥65

已售 0

16.52MB

申请报告

Alimeeting4MUG数据集

Introduction

Data Collection and Annotations

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群