^^。

verify-tagBambara French Parallel dataset

africaearth and naturenlptext generationtranslationbambara

19

已售 0
21.81MB

数据标识:D17171431028163761

发布时间:2024/05/31

以下为卖家选择提供的数据验证报告:

数据描述

Introduction

Bambara, also called Bamanankan or Bamana, is a language widely used as a vehicular and commercial language in West Africa and one of the national languages of Mali. Being member of the Mande language family, it is part of the main group in number of speakers, namely the Mandingo language group. This group includes, in addition to Bambara, Dioula in Côte d’Ivoire and Burkina Faso, Mandinka in Senegal and Gambia, as well as the Maninka from Guinea. According to Worlddata, Bambara is not an official language in any of these countries, however is spoken as mother tongue by a minor part of the population. It is most widespread in Mali with a share of around 46% among citizens. For instance, a total of about 15.0 million people worldwide speak Bambara as their mother tongue.

Overview

The Bambara-French Parallel Dataset is a comprehensive resource designed for a wide array of machine learning projects that require parallel text data, including but not limited to translation, text-to-text generation, and linguistic analysis. This dataset features a collection of 46,976 aligned sentences, making it an invaluable tool for researchers and developers working on language models, especially those focusing on Bambara and French language pairings.

Data Sources

The sentences in this dataset have been meticulously compiled from the Corpus Bambara de Reference, encompassing a diverse array of sources including periodicals, books, short stories, blog posts, and select passages from religious texts such as the Bible and the Quran. The texts cover a broad range of topics, offering rich linguistic diversity for training and testing machine learning models.

Dataset Composition

The dataset is provided in multiple file formats to accommodate various research needs and preferences:

  • bambara-french-parallel.csv: This CSV file is the primary dataset format, created to ensure easy access and manipulation. The data is encoded in UTF-8, with quoting applied to handle special characters efficiently.

  • bambara-french-parallel.feather: For those who prefer a binary format, the Feather version of the dataset offers fast, efficient loading and saving, making it ideal for use in data science projects where speed is a priority.

  • bambara-french-parallel.json: The JSON version provides a structured format that's easily consumable by web applications and services, facilitating seamless integration with modern tech stacks.

  • text.bam: A plain text file containing all Bambara sentences, with one sentence per line. This file serves as a foundational resource for generating the parallel dataset.

  • **text.fr: **Similar to text.bam, this file contains all French sentences from the dataset, arranged one sentence per line, providing a straightforward resource for language processing tasks.

Source Repository

The source files for the dataset, including text.bam and text.fr, were obtained from the following GitHub repository: RobotsMali-AI Datasets. These files were instrumental in constructing the parallel dataset by meticulously mapping each line to its corresponding translation.

Usage Recommendations

For optimal usage, we recommend utilizing the Feather or JSON versions of the dataset. These formats circumvent the complexities associated with quoting in the CSV file, ensuring a smoother data handling experience for researchers and developers.

data icon
Bambara French Parallel dataset
19
已售 0
21.81MB
申请报告