language and languages « Full Text Reports…

Dating the Origin of Language Using Phonemic Diversity

June 18, 2012 Comments off

Dating the Origin of Language Using Phonemic Diversity
Source: PLoS ONE

Language is a key adaptation of our species, yet we do not know when it evolved. Here, we use data on language phonemic diversity to estimate a minimum date for the origin of language. We take advantage of the fact that phonemic diversity evolves slowly and use it as a clock to calculate how long the oldest African languages would have to have been around in order to accumulate the number of phonemes they possess today. We use a natural experiment, the colonization of Southeast Asia and Andaman Islands, to estimate the rate at which phonemic diversity increases through time. Using this rate, we estimate that present-day languages date back to the Middle Stone Age in Africa. Our analysis is consistent with the archaeological evidence suggesting that complex human behavior evolved during the Middle Stone Age in Africa, and does not support the view that language is a recent adaptation that has sparked the dispersal of humans out of Africa. While some of our assumptions require testing and our results rely at present on a single case-study, our analysis constitutes the first estimate of when language evolved that is directly based on linguistic data.

Categories: language and languages, PLoS ONE

A Comparative Study of PDF Generation Methods: Measuring Loss of Fidelity When Converting Arabic and Persian MS Word Files to PDF

June 9, 2011 Comments off

A Comparative Study of PDF Generation Methods: Measuring Loss of Fidelity When Converting Arabic and Persian MS Word Files to PDF
Source: Mitre Corporation

Converting files to Portable Document Format (PDF) is popular due to the format’s many advantages. For example, PDF allows an author to control or preserve the rendering of a digital document, distribute it to other systems, and ensure that it displays in a viewer as intended.

From the perspective of Human Language Technology (HLT), however, PDFs are problematic. PDF is a display-oriented digital document format; the point of PDF is to preserve the appearance of a document, not to preserve the original electronic text. We observed errors in PDF-extracted text indicating that either the PDF generator or extractor, or both, mishandled the document structure, character data, and/or entire textual objects. And we learned that other HLT researchers reported data loss when extracting electronic text from PDFs. This motivated further study of digital document data exchange using PDFs.

MITRE conducted an exploratory study of data exchange using PDF in order to investigate the data loss phenomenon. We limited our study to Middle Eastern electronic text: specifically Arabic and Persian. The study included a test for scoring PDF generation methods—(a) using a common, best-practice setup to generate PDFs and extract text, and (b) using character accuracy to quantify the quality of PDF-extracted text. We ranked 8 methods according to the resulting accuracy scores. The 8 methods map to 3 core PDF generation classes. At best, the Microsoft Word class resulted in 42% Overall Accuracy. Best scores for the PDFMaker and Acrobat Distiller/PScript5.dll classes were 95% and 96%, respectively.

This paper explains our tests and discusses the results, including evidence that using PDF for data exchange of typical Arabic and Persian documents results in a loss of important electronic text content. This loss confuses human language technologies such as search engines, machine translation engines, computer-assisted translation tools, named entity recognizers, and information extractors.

Furthermore, most of the spurious newlines, spurious spaces in tokens, spurious character substitutions, and entity errors observed in the study were due to the PDF generation method, rather than the PDF text extractor. So, using a common configuration to convert reliable electronic text to PDF for data exchange causes irretrievable loss of electronic text on the receiving end.

+ Full Paper (PDF)

Categories: language and languages, Mitre Corporation, technology and internet

About FullTextReports

FullTextReports is compiled and edited by Gary Price and Shirl Kennedy. The site is free to access.

Before launching FullTextReports, Price and Kennedy were senior editors at ResourceShelf (ResourceBlog) and DocuTicker (DocuBase) for more than 10 years.
This website is updated as often as possible during the week and at least once a day on the weekends.
The sister site to FullTextReports, INFOdocket, offers information industry news, useful websites, search tips and tools...and occasional commentary.
FullTextReports is not DocuTicker (or DocuBase) and INFOdocket is not ResourceShelf (or ResourceBlog). Gary and Shirl are no longer contributors to either of those sites.
You can contact Shirl and Gary at: FullTextReports@gmail.com or INFOdocket@gmail.com.

Full Text Reports…

Archive

Dating the Origin of Language Using Phonemic Diversity

A Comparative Study of PDF Generation Methods: Measuring Loss of Fidelity When Converting Arabic and Persian MS Word Files to PDF

FullTextReports by e-mail? Yes!

Click to register for a daily update

About FullTextReports

Categories

Archives

Follow “Full Text Reports...”

Full Text Reports…

Archive

Increases in Individualistic Words and Phrases in American Books, 1960–2008

Dating the Origin of Language Using Phonemic Diversity

A Comparative Study of PDF Generation Methods: Measuring Loss of Fidelity When Converting Arabic and Persian MS Word Files to PDF

FullTextReports by e-mail? Yes!

Click to register for a daily update

About FullTextReports

Categories

Archives

Follow “Full Text Reports...”