Various scripts manipulating Tibetan Texts, including scripts to:
- Convert Word Files in Sambhota fonts to Unicode (Windows and Mac)
- Insert milestones in Sambhota converted files based on OCR
- Add THL Styles to a Word Doc with options to apply styles to annotations and/or milestons
- Convert between txt, rtf, and docx files (on Mac)
- Split a volume into texts
This was developed originally as Sambhota Conversion Scripts for the Peltsek Kama files that were in Sambhota font, but in the process of developing another set of scripts InsertMilestones in the repo by the same name at https://github.com/thl-texts/insert-milestones.git, it was decided to merge those scripts here and create a general Tibetan Text Scripts repo.
- Developer: Than Grove
- Creation Date: July 31, 2019
As detailed in the next section, the Sambhota conversion method uses both a Mac OS 10 and Windows 10 environment. The main conversion script is a Python script run on the Mac. It was written for Python 3 and only uses two additional packages:
- lxml
- python-doc
A requirements.txt
file has been included so one can set up a virtual environment. This is done through the following
steps:
- Install Python on your Mac using Homebrew with
brew install python3
(Note: if you don't have Homebrew installed, look to its installation page). This will also installpip
- In a terminal window navigate into this repos directory on your Mac machine
- Type
python -m virtualenv venv
- Type
source venv/bin/activate
- Type
pip install -r requirements.txt
Note: To get out of the virtual environment, either close the terminal or type deactivate
This is the process to take chunked texts of a Kama volume and insert the milestones.
- Copy volume chunked files into the workspace\in folder
- Run
insert_milestones.py -v 96 -s 8
where-v
is the volume number and-s
is the page that the volume starts on, skipping introductory TOC etc., i.e., the first page of first text. - Run
add_styles2docx.py -i workspace/out -a -m
Other options include:
-b
(blank pages skipped in scan so scan image number doesn't count them),-br
(pages which represent a new file to force a break when script not working properly),-c
Clear files from in and out folders and copy in new raw chunk files for the volume in question,-d
turn on debugging.
The conversion process is a multi-stepped process that requires both a Mac and a Windows machine to proceed most efficiently. I did it by using Virtual Box to create a virtual Windows 10 machine on my Mac and shared the files between the virtual Windows machine and my Mac through UVA Box, but other ways to transfer files are also possible. When you just want to convert the individual files into templated Word docs with the same name, then the steps in the process are:
- Download the original Word .doc files to the
in
folder here on Mac - Run
doc2rtf.sh in out
- this converts the .doc files to .rtf and puts in theout
folder - Transfer to Windows machine (Upload to UVA Box and then download to Windows Virtual machine), putting them
in the
in
folder - Run the batch script
convertSamRTF.bat
. This will use UDP to convert the Sambhota files to Unicode and will put the converted Unicode files in theout
folder- Make sure that in UDP Advance settings you click the box "Put smaller font between « and »"
- Transfer the unicode RTF files to the
in
folder on the Mac machine - Run the
rtf2docx.sh in out
script, which converts the files to.docx
format and puts them in theout
folder. Put the.docx
files in thein
folder. - Run the python script
add_styles2docx.py
. This will convert all documents in thein
folder and put them in theout
folder
The converted .docx
files will then be in the out
folder
The Python script will convert each document into Unicode .docx file with all paragraphs as Paragraph style but with the text between the « and » marked up as Annotations style. A metadata table will be placed at the top of the document.
The document template used is in the resources
folder. It is tibtext-styled-tpl.docx
.
The script add_styles2docx.py will take all the .docx files in the /in folder and add the THL styles to them, producing new documents in the out folder. When this script is run, it gives you three options. You can choose 1. whether or not to add the metadata table and basic headers to the resulting Word doc (this is primarily for individual texts not volumes), 2. whether or not to convert text surrounded by << and >> into annotations by removing those markers and applying the Word style, and 3. whether or not to apply the styles to the page and line milestones in the doc with the format "[23.4]".
Sometimes the Sambhota texts received are volumes or volume "chunks" that need to be broken up into individual texts.
For the Kama I wrote a script that did this, which may work for other collections. This script is split_into_texts.py
All documents to be processed should be put in the in
folder and the results will go in the out
folder. There are
two global variables of note. filefilterstr
is the file name prefix for the files to process in the IN folder,
assuming they all begin with the same string such as "KAMA". text_break_string
is the string used to split the chunks
into texts. This is the universal string that separates two texts in a document. For the KAMA it was '༄༅'. Further
modifications will need to be done to break on the 'བཞུགས་སོ།' since the line just before will also need to be included in
the subsequent text. This has not yet been implemented. (Update: This method is problematic creating too many false breaks
and/or missing real breaks. Better to change or add a script to break up based on cataloging data once the milestones
have been inserted.)
NOTE: It is important that the styles be up-to-date. Download the latest Word doc template with THL Styles
from this Google Docs link (It is in the folder
Toolbox for Tibetan Library > Tibetan Text Work >
Inputting & Editing Tibetan Texts). Then open a new
document with that template, paste in the metadata table from the old one, and replace the old one at resources\tibtext-styled-tpl.docx
Comments:
- Some scripts such as
rtf2txt.sh
are residual from previous attempts. They are not used in the conversion but I have left them in the repo in case they are useful in the future.