Converting Bible translation from MS Word/LibreOffice

The files referred to in this article can be downloaded here: Tarangan_James_Philemon

While we normally encourage translation work to be done in one of our tools (Autographa, translationStudio, vMAST) sometimes it is already in process or is just better for the local team to do it in another program.

We received work that had been done in Microsoft Word. It had been formatted, so that the verse numbers were superscripted (like 1 this). When this formatting has been done, it makes the document regular, and therefore easier to convert.

I don’t have MS Word, so I opened the document in LibreOffice. The first thing I did was go to File… and choose Export… The format I chose was XHTML (.html;.xhtml). This made a copy of the file with an extension of .html.

Now I opened the html file in a text editor. I used Bluefish, a cross-platform free open-source HTML editor.

At the top of the file was a lot of formatting information:

 

<?xml version="1.0" encoding="UTF-8"?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1 plus MathML 2.0//EN" "http://www.w3.org/Math/DTD/mathml2/xhtml-math11-f.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<!--This file was converted to xhtml by LibreOffice - see http://cgit.freedesktop.org/libreoffice/core/tree/filter/source/xslt for the code.-->
<head profile="http://dublincore.org/documents/dcmi-terms/">
<meta http-equiv="Content-Type" content="application/xhtml+xml; charset=utf-8"/>
<title xml:lang="en-US">
- no title specified</title>
<meta name="DCTERMS.title" content="" xml:lang="en-US"/>
<meta name="DCTERMS.language" content="en-US" scheme="DCTERMS.RFC4646"/>
<meta name="DCTERMS.source" content="http://xml.openoffice.org/odf2xhtml"/>
<meta name="DCTERMS.creator" content="lifestyle"/>
<meta name="DCTERMS.issued" content="2019-02-21T11:42:00" scheme="DCTERMS.W3CDTF"/>
<meta name="DCTERMS.contributor" content="user"/>
<meta name="DCTERMS.modified" content="2019-02-21T11:45:00" scheme="DCTERMS.W3CDTF"/>
<meta name="DCTERMS.provenance" content="" xml:lang="en-US"/>
<meta name="DCTERMS.subject" content="," xml:lang="en-US"/>
<link rel="schema.DC" href="http://purl.org/dc/elements/1.1/" hreflang="en"/>
<link rel="schema.DCTERMS" href="http://purl.org/dc/terms/" hreflang="en"/>
<link rel="schema.DCTYPE" href="http://purl.org/dc/dcmitype/" hreflang="en"/>
<link rel="schema.DCAM" href="http://purl.org/dc/dcam/" hreflang="en"/>
<style type="text/css">

@page { }
table { border-collapse:collapse; border-spacing:0; empty-cells:show }
td, th { vertical-align:top; font-size:12pt;}
h1, h2, h3, h4, h5, h6 { clear:both;}
ol, ul { margin:0; padding:0;}
li { list-style: none; margin:0; padding:0;}
/* "li span.odfLiEnd" - IE 7 issue*/
li span. { clear: both; line-height:0; width:0; height:0; margin:0; padding:0; }
span.footnodeNumber { padding-right:1em; }
span.annotation_style_by_filter { font-size:95%; font-family:Arial; background-color:#fff000; margin:0; border:0; padding:0; }
span.heading_numbering { margin-right: 0.8rem; }* { margin:0;}
.gr1 { border-width:0.0133cm; border-style:solid; border-color:#000000; font-size:11pt; margin-bottom:0.0146in; margin-left:0.1252in; margin-right:0.1402in; margin-top:0in; padding:0.0591in; font-family:Calibri; vertical-align:top; min-height:0in;min-width:0in;padding-top:0.05in; padding-bottom:0.05in; padding-left:0.1in; padding-right:0.1in; }
.gr2 { border-width:0.0133cm; border-style:solid; border-color:#000000; font-size:11pt; margin-bottom:0.0028in; margin-left:0.1252in; margin-right:0.1252in; margin-top:0in; padding:0.0591in; font-family:Calibri; vertical-align:top; min-height:0.2728in;min-width:0.1409in;padding-top:0.05in; padding-bottom:0.05in; padding-left:0.1in; padding-right:0.1in; }
.Footer { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.Header { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.P1 { color:#00000a; font-size:11pt; text-align:right ! important; font-family:Calibri; writing-mode:lr-tb; line-height:200%; }
.P10 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; }
.P11 { font-size:18pt; font-family:Calibri; writing-mode:page; text-align:left ! important; }
.P12 { font-size:18pt; font-family:Calibri; writing-mode:page; text-align:left ! important; }
.P2 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; }
.P3 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.P4 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.P5 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; font-weight:bold; }
.P6 { color:#00000a; font-size:11pt; text-align:center ! important; font-family:Calibri; writing-mode:lr-tb; font-weight:bold; }
.P7 { color:#00000a; font-size:11pt; text-align:left ! important; font-family:Calibri; writing-mode:lr-tb; line-height:150%; }
.Standard { font-size:11pt; font-family:Calibri; writing-mode:lr-tb; text-align:left ! important; color:#00000a; }
.T1 { font-style:italic; }
.T2 { font-style:italic; }
.T3 { font-size:16pt; font-weight:bold; }
.T5 { font-weight:bold; }
.T6 { font-size:18pt; font-weight:bold; }
.T7 { vertical-align:super; font-size:58%;}
.T8 { vertical-align:super; font-size:58%;font-weight:bold; }
/* ODF styles with no properties representable as CSS */
.Sect1 .T4 { }
</style>

The only part of this that is important to me is the .T7 and .T8, which are both vertical-align:super; — These will be superscripted, and are probably verse numbers or footnotes.

Sure enough, when you look down (below the <body… tag) you can see <span class="T7">3</span>. (Verse 1 was skipped, using just the chapter number, and verse 2 had been missed, just being in the body.)

We’re ready to start cleaning up the text.

First, we remove everything before the beginning of the text. In this case, it is Salam Aban yudas. Then, we put in \c 1 for the first chapter, and \v 1 on a new line for the first verse. There should be a space, followed by the text of verse 1. I found the number 2 for verse 2 and did the same thing.

Now, most of the rest can be done automatically: I go to the Edit menu and choose Advanced Find and Replace (other programs may call this something different; for example, in Notepad++ for Windows, it’s under the Search menu, Replace…). Using a Regular expression so that we can clean it in one step, we search for

<span class="T7">(\d+)</span>

and we replace it with

\n\\v \1

Let’s explain this piece by piece. <span class=”T7″> should make sense: that’s what we saw in the formatting information at the beginning. Everything up to the next </span> tag will be superscripted, and should be a verse number.

The parentheses are to “capture” what matches inside them. That’s so that we don’t lose the verse number. \d is a regular expression that means “a digit”, or a number from 0-9. The + that follows tells us to match one or more of what comes before it. So, \d+ means match one or more digits. In some programs we also need to add ?, meaning “don’t take more than you need”. So, it would be \d+? inside the parentheses.

For the replacement, \n means “start a new line”. In USFM every verse needs to be on its own line. Then, we say \\v because we want to get \v. With regular expressions, the backslash \ is a special character (remember \d?) so if we actually want a \ we have to double it. \1 means “match the contents of the first pair of parentheses”. In other words, \1 will match our verse number. For the first verse in this file, that’s verse 3, but it will match all of them.

The final thing to notice is that there’s a space after the \1 in the replacement phrase. It’s important to have a space between the verse number and the verse text.

So, for the first verse with a superscripted number in this file, we have <span class=”T7″>3</span>gwel jak ago being turned into

\v 3 gwel jak ago

(See how it’s on its own line?)

If you want to do all of this editing in LibreOffice, you may need to change the file extension of the HTML to .txt to see the HTML codes.

With the sample files, we have a couple of other things to look at. First, the “front matter” is missing, so no one will know what book this is. Full documentation about USFM can be found here: http://ubsicap.github.io/usfm/identification/index.html, or you can look at a project for another language that has been saved from translationStudio or Autographa.

For Jude, the first lines should be:

\id jud Regular
\ide usfm
\h Jude
\toc1 Jude
\toc2 Jude
\toc3 jud
\mt Jude

Let’s look at this line-by-line.

\id jud Regular tells programs that this is the book of Jude, and that it’s an OL translation. It could also say \id jud ULB, or \id jud Tarangan
\ide usfm
tells programs that this is usfm, so they can decode it properly.
\h Jude is running header information. In this case, I would actually recommend using \h Yudas.
\toc1 is for the long form of the book name. In English, for example, we might put \toc1 The Epistle of James or \toc1 The Letter from James.
\toc2 is for the shorter form of the book name. \toc2 Yudas would be fine.
\toc3 is for an abbreviated name of the book. This is useful if you use a short form (Jhn 3:16) notation.
\mt is the title of the book as it’s printed at the top of the first page of the book. If you want to use multiple lines, you can use \mt1, \mt2, etc. In this case it should probably be \mt Salam Aban yudas.

Important note here: don’t change the book abbreviation in the first line: \id jud. This is the identifier for programs, and is based on English. All of the other places the book name appears, you can feel free to change it to the local name.

The file should be saved with the language code _ book code _ resource type _ project type. In this case, tre_jud_text_reg.usfm. (Please understand that I don’t know which Tarangan language this book is in, so I chose one of the language codes. Use the correct language code.)

When there are additional lines in the translation for section headings, USFM deals with these in a special way. (These are not part of the translation, as they are not from the original Bible texts — they are just to help people understand what they are about to read.) In this file, we have Allah On Aukum Dir-Dir Ago Daisago Sala. This should be on a line by itself with a \s tag and a space to show that it is a section name:

\s Allah On Aukum Dir-Dir Ago Daisago Sala

Finally, this file contains two books of the Bible: James and Philemon. These need to be put out into their own files, one each. Follow the same directions for Philemon that we have followed for James.

Make sure you check for verses that weren’t formatted correctly: on the first run through we were missing verses 1, 2, 5, 13, and 16. Verse 16 was missing altogether.

 

Bluefish, a cross-platform free open-source HTML editor.

MAST technician job description and what to expect.

A MAST technician will encounter a wide range of issues on a MAST. The nature of the issues will be both technical and social. Having impeccable technical skills will only get you part of the way there. You should also be aware that you will need the so-called “soft...

Post MAST-TSP Workshop

Hello all, As we are developing the materials and support that we all need. Would you please add a post here of those things that you would like to have or need to know where it is? For instance - here are the websites that will be good to remember:...

Installing MAST Apps on a tablet

Here is a video on how I install apps...not just any apps....but MAST specific apps on atablet.  I know....sounds exciting!  Well, actually it is.  The work that translators do on these tablets is life changing.  Along with the video, I have added tablet specs and an...

How to Get Help

John explains the different tools used for technical support for Bible Translation.  Email is probably the best type of support.  Just send a helpdesk ticket to [email protected]       https://www.youtube.com/watch?v=mphZwBbiuuU   Some...

Changing Server Settings

Here is documentation on how to set the WACS server up in tStudio Android. I’m thinking that this will become a popular question. Server Settings Keep the settings for the server as determined by the program. There is no need to edit any of the settings unless...

Android Install

The following apps are recommended to install and place on the front pane of an Android tablet for written and oral translation projects.  The tech only items should not appear on pane display.  

Installing SyncThing on a Windows 10 computer

Follow these simple instructions on installing SyncThing on a Windows 10 machine for MAST data collection. Notes: Install SyncTrazor

How to migrate or clone a repo on door43 to WACS

How to migrate or clone a repo on Door43 to WACS. Compare mirror and non-mirror options.

How to format Translation Notes for publishing (V-MAST)

In V-MAST, translators are given free rein to reformat the content. However, they should not reformat content destined to be published as source text. This article specifies the expected content format. Note that BTT Writer does its own formatting when translating tN,...

How to Prepare tS Translation Notes Projects for Publishing

Note: This article is one in a series of articles describing how to get a GL project ready to publish so that others can benefit from it. This article deals with Translation Notes (tN) projects created in translationStudio. A complete Bible tN project consists of...