Converting Bible translation from MS Word/LibreOffice
Posted by: Mondele

The files referred to in this article can be downloaded here: Tarangan_James_Philemon

While we normally encourage translation work to be done in one of our tools (Autographa, translationStudio, vMAST) sometimes it is already in process or is just better for the local team to do it in another program.

We received work that had been done in Microsoft Word. It had been formatted, so that the verse numbers were superscripted (like 1 this). When this formatting has been done, it makes the document regular, and therefore easier to convert.

I don’t have MS Word, so I opened the document in LibreOffice. The first thing I did was go to File… and choose Export… The format I chose was XHTML (.html;.xhtml). This made a copy of the file with an extension of .html.

Now I opened the html file in a text editor. I used Bluefish, a cross-platform free open-source HTML editor.

At the top of the file was a lot of formatting information:

 

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!--This file was converted to xhtml by LibreOffice - see http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/xslt for the code.-->
<head profile="http://dublincore.org/documents/dcmi-terms/">
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
<title xml:lang="en-US">
- no title specified</title>
<meta name="DCTERMS.title" content="" xml:lang="en-US"/>
<meta name="DCTERMS.language" content="en-US" scheme="DCTERMS.RFC4646"/>
<meta name="DCTERMS.source" content="http://xml.openoffice.org/odf2xhtml"/>
<meta name="DCTERMS.creator" content="lifestyle"/>
<meta name="DCTERMS.issued" content="2019-02-21T11:42:00" scheme="DCTERMS.W3CDTF"/>
<meta name="DCTERMS.contributor" content="user"/>
<meta name="DCTERMS.modified" content="2019-02-21T11:45:00" scheme="DCTERMS.W3CDTF"/>
<meta name="DCTERMS.provenance" content="" xml:lang="en-US"/>
<meta name="DCTERMS.subject" content="," xml:lang="en-US"/>
<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" hreflang="en"/>
<link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" hreflang="en"/>
<link rel="schema.DCTYPE" href="http://purl.org/dc/dcmitype/" hreflang="en"/>
<link rel="schema.DCAM" href="http://purl.org/dc/dcam/" hreflang="en"/>
<style type="text/css">

@page { }
table { border-collapse:collapse; border-spacing:0; empty-cells:show }
td, th { vertical-align:top; font-size:12pt;}
h1, h2, h3, h4, h5, h6 { clear:both;}
ol, ul { margin:0; padding:0;}
li { list-style: none; margin:0; padding:0;}
/* "li span.odfLiEnd" - IE 7 issue*/
li span. { clear: both; line-height:0; width:0; height:0; margin:0; padding:0; }
span.footnodeNumber { padding-right:1em; }
span.annotation_style_by_filter { font-size:95%; font-family:Arial; background-color:#fff000; margin:0; border:0; padding:0; }
span.heading_numbering { margin-right: 0.8rem; }* { margin:0;}
.gr1 { border-width:0.0133cm; border-style:solid; border-color:#000000; font-size:11pt; margin-bottom:0.0146in; margin-left:0.1252in; margin-right:0.1402in; margin-top:0in; padding:0.0591in; font-family:Calibri; vertical-align:top; min-height:0in;min-width:0in;padding-top:0.05in; padding-bottom:0.05in; padding-left:0.1in; padding-right:0.1in; }
.gr2 { border-width:0.0133cm; border-style:solid; border-color:#000000; font-size:11pt; margin-bottom:0.0028in; margin-left:0.1252in; margin-right:0.1252in; margin-top:0in; padding:0.0591in; font-family:Calibri; vertical-align:top; min-height:0.2728in;min-width:0.1409in;padding-top:0.05in; padding-bottom:0.05in; padding-left:0.1in; padding-right:0.1in; }
.Footer { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.Header { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.P1 { color:#00000a; font-size:11pt; text-align:right ! important; font-family:Calibri; writing-mode:lr-tb; line-height:200%; }
.P10 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; }
.P11 { font-size:18pt; font-family:Calibri; writing-mode:page; text-align:left ! important; }
.P12 { font-size:18pt; font-family:Calibri; writing-mode:page; text-align:left ! important; }
.P2 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.P3 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.P4 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.P5 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; font-weight:bold; }
.P6 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; font-weight:bold; }
.P7 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.Standard { font-size:11pt; font-family:Calibri; writing-mode:lr-tb; text-align:left ! important; color:#00000a; }
.T1 { font-style:italic; }
.T2 { font-style:italic; }
.T3 { font-size:16pt; font-weight:bold; }
.T5 { font-weight:bold; }
.T6 { font-size:18pt; font-weight:bold; }
.T7 { vertical-align:super; font-size:58%;}
.T8 { vertical-align:super; font-size:58%;font-weight:bold; }
/* ODF styles with no properties representable as CSS */
.Sect1 .T4 { }
</style>

The only part of this that is important to me is the .T7 and .T8, which are both vertical-align:super; — These will be superscripted, and are probably verse numbers or footnotes.

Sure enough, when you look down (below the <body… tag) you can see <span class="T7">3</span>. (Verse 1 was skipped, using just the chapter number, and verse 2 had been missed, just being in the body.)

We’re ready to start cleaning up the text.

First, we remove everything before the beginning of the text. In this case, it is Salam Aban yudas. Then, we put in \c 1 for the first chapter, and \v 1 on a new line for the first verse. There should be a space, followed by the text of verse 1. I found the number 2 for verse 2 and did the same thing.

Now, most of the rest can be done automatically: I go to the Edit menu and choose Advanced Find and Replace (other programs may call this something different; for example, in Notepad++ for Windows, it’s under the Search menu, Replace…). Using a Regular expression so that we can clean it in one step, we search for

<span class="T7">(\d+)</span>

and we replace it with

\n\\v \1

Let’s explain this piece by piece. <span class=”T7″> should make sense: that’s what we saw in the formatting information at the beginning. Everything up to the next </span> tag will be superscripted, and should be a verse number.

The parentheses are to “capture” what matches inside them. That’s so that we don’t lose the verse number. \d is a regular expression that means “a digit”, or a number from 0-9. The + that follows tells us to match one or more of what comes before it. So, \d+ means match one or more digits. In some programs we also need to add ?, meaning “don’t take more than you need”. So, it would be \d+? inside the parentheses.

For the replacement, \n means “start a new line”. In USFM every verse needs to be on its own line. Then, we say \\v because we want to get \v. With regular expressions, the backslash \ is a special character (remember \d?) so if we actually want a \ we have to double it. \1 means “match the contents of the first pair of parentheses”. In other words, \1 will match our verse number. For the first verse in this file, that’s verse 3, but it will match all of them.

The final thing to notice is that there’s a space after the \1 in the replacement phrase. It’s important to have a space between the verse number and the verse text.

So, for the first verse with a superscripted number in this file, we have <span class=”T7″>3</span>gwel jak ago being turned into

\v 3 gwel jak ago

(See how it’s on its own line?)

If you want to do all of this editing in LibreOffice, you may need to change the file extension of the HTML to .txt to see the HTML codes.

With the sample files, we have a couple of other things to look at. First, the “front matter” is missing, so no one will know what book this is. Full documentation about USFM can be found here: http://ubsicap.github.io/usfm/identification/index.html, or you can look at a project for another language that has been saved from translationStudio or Autographa.

For Jude, the first lines should be:

\id jud Regular
\ide usfm
\h Jude
\toc1 Jude
\toc2 Jude
\toc3 jud
\mt Jude

Let’s look at this line-by-line.

\id jud Regular tells programs that this is the book of Jude, and that it’s an OL translation. It could also say \id jud ULB, or \id jud Tarangan
\ide usfm
tells programs that this is usfm, so they can decode it properly.
\h Jude is running header information. In this case, I would actually recommend using \h Yudas.
\toc1 is for the long form of the book name. In English, for example, we might put \toc1 The Epistle of James or \toc1 The Letter from James.
\toc2 is for the shorter form of the book name. \toc2 Yudas would be fine.
\toc3 is for an abbreviated name of the book. This is useful if you use a short form (Jhn 3:16) notation.
\mt is the title of the book as it’s printed at the top of the first page of the book. If you want to use multiple lines, you can use \mt1, \mt2, etc. In this case it should probably be \mt Salam Aban yudas.

Important note here: don’t change the book abbreviation in the first line: \id jud. This is the identifier for programs, and is based on English. All of the other places the book name appears, you can feel free to change it to the local name.

The file should be saved with the language code _ book code _ resource type _ project type. In this case, tre_jud_text_reg.usfm. (Please understand that I don’t know which Tarangan language this book is in, so I chose one of the language codes. Use the correct language code.)

When there are additional lines in the translation for section headings, USFM deals with these in a special way. (These are not part of the translation, as they are not from the original Bible texts — they are just to help people understand what they are about to read.) In this file, we have Allah On Aukum Dir-Dir Ago Daisago Sala. This should be on a line by itself with a \s tag and a space to show that it is a section name:

\s Allah On Aukum Dir-Dir Ago Daisago Sala

Finally, this file contains two books of the Bible: James and Philemon. These need to be put out into their own files, one each. Follow the same directions for Philemon that we have followed for James.

Make sure you check for verses that weren’t formatted correctly: on the first run through we were missing verses 1, 2, 5, 13, and 16. Verse 16 was missing altogether.

 

Bluefish, a cross-platform free open-source HTML editor.

Mozambique Travel

Check with the CDC (link) for immunizations and anti-malarial prophylaxis (medicine to keep you from getting malaria). I used Doxycycline which is cheap, but has to be taken every day and makes one more sensitive to the sun. It was simple to get my visa at the...

Consulting the Error Log

If you are having trouble with translationStudio it can be helpful to find out what the app is saying about the trouble. This is stored in a file called a "Log". The location of the log file varies depending upon the Operating System of the computer you are using. In...

Suggest an Edit on DCS

Overview You may suggest an edit to someone else's work on the Door43 Content Service using the following procedure.  Keep in mind that most users will gladly receive suggestions using this process. The process described below is called the "fork and pull request"...

Converting Bible translation from MS Word/LibreOffice

The files referred to in this article can be downloaded here: Tarangan_James_Philemon While we normally encourage translation work to be done in one of our tools (Autographa, translationStudio, vMAST) sometimes it is already in process or is just better for the local...

Tanzania Travel

Check with the CDC (link) for immunizations, etc. They are very serious about Yellow Fever, so if you think they might worry about you, get your shot at home. I don't recommend getting the inoculation at the airport in Dar. Visa is easy to get at the airport in Dar es...

Manual Solving Pull-Request Conflicts

There are instances where a DCS github user submits a Pull-Request from their Fork of the repository which cannot be Merged automatically. The steps below will identify one approach for a repository administrator to solve this problem. This approach requires that that...

Installing translationStudio on Windows

Windows Installation: Please note that Wycliffe Associates does not currently endorse any version of translationStudio greater than 11.1. Go to: Latest Releases of translationDesktop Installation Files  Click on the appropriate file for your computer.  Allow the file...

Tech Adv Bootcamp

https://wycliffeassociatesinc-my.sharepoint.com/:x:/g/personal/chuck_liesch_wycliffeassociates_org/EY2_CJTKSc1BisgqPuVe0iAB50U337s4_-Cin2w7R2a_cQ?e=7Holyu

What To Pack?

Tools to bring - A multi-purpose tool such as a Leatherman Skeletool or the REV Multi-Tool. Chris Jarka https://www.leatherman.com/skeletool-18.html ~$60US. It has a bit driver, but you'll probably need your own selection of bits. John Wood   A personal water...

Correcting Translations that Don’t Use Unicode

Sometimes you will encounter an older translation project where a non-Unicode font was used to display the necessary characters for the target language. Unfortunately, while this worked at the time, it doesn't convert well to other environments. So, we prefer to...