Ciao

Bible verses - 30 languages, IPA annotated

ArtsEntertainmentEducationLinguisticsReligionBelief Systems

￥1

11.06MB

数据标识：D17169766906975403

发布时间：2024/05/29

About Dataset

Original scripts can be found here:

https://github.com/mrinalmanu/bible_corpus_tools_in_python/blob/master/README.md
Author: Mrinal Vashisth; mrinalmanu10@gmail.com

It's my personal fun project :))

This is a set of functions for processing text for language processing, from XML files in Open Bible Data mentioned in the paper A massively parallel corpus: the Bible in 100 languages, Christos Christodouloupoulos and Mark Steedman.
(https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4551210/)

Get data here:
https://github.com/christos-c/bible-corpus

Description:

bcp.py contains functions to convert XML files into text files or CSV files. For whatever purpose.
For the sake of language processing CSV files are more informative and give output as a pandas dataframe with:

[verse_id] [text] [book] [name] [chapter] [verse_number]

Next step is to take the CSV and optimize data to lose minimum information and get the text of New Testament (NT) up and ready for analysis.

Final update:

I took the data and tried to find out which verses are shared across the most languages.

Then I took the epitran package and converted these verses into IPA annotation.

Using multiprocessing, and the dreadful rowwise operation on pandas dataframe, it took 4 hours with 7 CPUs to to process about 121,000 lines.

The final database contains 121,000 lines from 30 langauges.

I may in future figure out a way of using this dataframe.

The dataframe annotation_features.csv, includes details about the phonemes, and their annotation.

I wanted to do some kind of analysis. For example, group languages into families according to the phonetic information in well translated phrases. Given that each translation is exact, something about the phonemes along should be reflected. Maybe I'll continue this in future. In between, this intermediate result can be used with appropriate credit to the original authors:

**A massively parallel corpus: the Bible in 100 languages**, Christos Christodouloupoulos and Mark Steedman.

And me, of course, if you are using this version. No worries if using for study purposes. We are all students. :))

I used epitran for these tools. Citation required.

看了又看

验证报告

目前该文件尚无匹配的数据质量验证程序。我们将在后续版本中提供相应的验证支持，敬请谅解。

Bible verses - 30 languages, IPA annotated

￥1

11.06MB

申请报告

Bible verses - 30 languages, IPA annotated

About Dataset

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群