Ciao

Bible verses - 30 languages, IPA annotated

ArtsEntertainmentEducationLinguisticsReligionBelief Systems

1

已售 0
11.06MB

数据标识:D17169766906975403

发布时间:2024/05/29

数据描述

About Dataset

Original scripts can be found here:

https://github.com/mrinalmanu/bible_corpus_tools_in_python/blob/master/README.md
Author: Mrinal Vashisth; mrinalmanu10@gmail.com

It's my personal fun project :))

This is a set of functions for processing text for language processing, from XML files in Open Bible Data mentioned in the paper A massively parallel corpus: the Bible in 100 languages, Christos Christodouloupoulos and Mark Steedman.
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4551210/)

Get data here:
https://github.com/christos-c/bible-corpus

Description:

bcp.py contains functions to convert XML files into text files or CSV files. For whatever purpose.
For the sake of language processing CSV files are more informative and give output as a pandas dataframe with:

[verse_id] [text] [book] [name] [chapter] [verse_number]

Next step is to take the CSV and optimize data to lose minimum information and get the text of New Testament (NT) up and ready for analysis.
 
Final update:

I took the data and tried to find out which verses are shared across the most languages.

Then I took the epitran package and converted these verses into IPA annotation.

Using multiprocessing, and the dreadful rowwise operation on pandas dataframe, it took 4 hours with 7 CPUs to to process about 121,000 lines.

The final database contains 121,000 lines from 30 langauges.

I may in future figure out a way of using this dataframe.

The dataframe annotation_features.csv, includes details about the phonemes, and their annotation.
 
I wanted to do some kind of analysis. For example, group languages into families according to the phonetic information in well translated phrases. Given that each translation is exact, something about the phonemes along should be reflected. Maybe I'll continue this in future. In between, this intermediate result can be used with appropriate credit to the original authors:

**A massively parallel corpus: the Bible in 100 languages**, Christos Christodouloupoulos and Mark Steedman.

And me, of course, if you are using this version. No worries if using for study purposes. We are all students. :))

I used epitran for these tools. Citation required. 

验证报告

卖家暂未授权典枢平台对该文件进行数据验证,您可以向卖家

申请验证报告

data icon
Bible verses - 30 languages, IPA annotated
1
已售 0
11.06MB
申请报告