How to post a Chinese/Japanese/Korean (CJK) article to arXiv

arXiv has a noble policy which accepts non-English articles, as explained in this FAQ entry. But the TeX compiler in the arXiv is the standard one, not modified for East Asian languages. One way is just to post the compiled PDF, but this is discouraged, as explained in this FAQ entry. In the following I describe how you can post a CJK article to the arXiv. The key point is that the arXiv has the full install of TeX Live, so we can use whatever macros in it.

I strongly recommend you to get the standard full install of TeX Live, if you haven't done that already. Yes it takes a lot of disk space. But your aim here is to prepare a tricky TeX file which can be processed by the arXiv auto-compiler. Being able to compile your file on your local specialized TeX installation doesn't say much about whether the arXiv can compile it. So, just get the full TeX Live, and test your file against that.

I was helped a lot by the good folks at TeX.StackExchange. See this question and this question, for example.

Minimal steps

  1. Prepare your TeX file as you wish, in pTeX or XeTeX or whatever.
  2. Include the CJK package and use the CJK environment as follows:
    \documentclass{article}
    \usepackage{CJK}
    \begin{document}
    \begin{CJK}{UTF8}{min}
    ... main material ... 
    \end{CJK}
    \end{document}
    You should now save the file in UTF8. Don't think of using EUC or SJIS, which brings you a lot of pain. {min} stands for the mincho font for Japanese. For Chinese, you should replace it with {song}. Check that your file compiles with latex or pdflatex.
  3. You need to include in your submission a file called 00README.XXX, with a line saying nohypertex. This stops arxiv to automatically include hyperref facility. Otherwise, it fails to compile. For more on 00README.XXX, read this FAQ entry.
  4. Here's sample1.tex and the generated PDF so far. That's it!
  5. Note that in order to change "Figure 1" to 図1 in the figure captions, you need to say \def\figurename{図}. This can't appear in the preamble, because CJK characters are only allowed within \begin{CJK}...\end{CJK}.

Adding hyperref

  1. To enable hyperlinking inside the TeX file, add the line
    \usepackage[CJKbookmarks]{hyperref}
    Don't forget to add the option CJKbookmarks. Also, note that you should not remove the line nohypertex from 00README.XXX. You still need to stop arXiv from attempting to insert hypertex automatically.
  2. Now, if you look the resulting PDF carefully, you notice that the bookmarks in the PDF is corrupt. This is because the intermediate file, called FOO.out containing the bookmark data, is in UTF8. (Here I'm assuming your tex file is named FOO.tex.) You need to change it to UTF16-BE and write it out in octal sequences.
  3. Download this perl script and run
    perl tweakbookmark.perl <FOO.out >tmp.out
    mv tmp.out FOO.out
    You then need to add a line to your TeX file saying
    \let\WriteBookmarks\relax
    to stop overwriting the corrected FOO.out file with the uncorrected one. Compile the code again. Here's sample2.tex and the generated PDF so far.

Making it look better

  1. If you're satisfied by the quality of Japanese fonts used, that's it. If you're not, you can do the following. First, go and get IPA fonts. You can use any other re-distributable TrueType fonts. Let's say you decide to use the font ipamp.ttf. If you use a font named FONT.ttf, just replace all appearances of ipamp with FONT in the following.
  2. In the following we use PDFTeX. In principle it can be done with dvips, but it requires a lot of additional steps, and I haven't figured out exactly how. You might need to convert your eps figures to pdf, so that it can be included to PDFTeX.
  3. Now, the standard TeX assumes a font only contains 256 letters. But Unicode fonts contains 2562 letters. Therefore, CJK.sty uses a trick which decomposes a CJK font into 256 subfonts. You need to generate 256 TeX font metric files as follows:
    ttf2tfm ipamp.ttf -q ipamp@Unicode@
    Then you add the following lines to your TeX file:
    \makeatletter
    \AtBeginDvi{\pdfmapline{=ipamp@Unicode@  <ipamp.ttf}}
    \DeclareFontFamily{C70}{ipamp}{\hyphenchar \font\m@ne}
    \DeclareFontShape{C70}{ipamp}{l}{n}{ <-> CJK * ipamp}{}
    \DeclareFontShape{C70}{ipamp}{m}{n}{ <-> CJK * ipamp}{\CJKnormal}
    \DeclareFontShape{C70}{ipamp}{bx}{n}{ <-> CJKb * ipamp}{\CJKbold}
    \makeatother
    
    and change \begin{CJK}{UTF8}{min} to \begin{CJK}{UTF8}{ipamp}. Here's sample3.tex and the generated PDF so far.
  4. In the submission, you now need to include You need to add a line to 00README.XXX saying
    Licence.txt ignore
    so that the licence file doesn't appear in the final download in the arXiv entry. Here is a sample 00README.XXX file.
  5. If your .ttf file is small, this is the end of the story. However, if you really use ipamp.ttf, it's around 8 Mbytes, which is above the upper bound for an individual file at the arXiv as of 2011. So you need to remove the unnecessary characters from the font file.
  6. For that, you need to install fontforge. We're going to enumerate all characters used in your TeX file, generate a fontforge script, and create a new font.
  7. Install fontforge. That might take for a while.
  8. Get this perl script, and run
    perl generatefontforgescript.perl FOO.tex FOO.bib > select.pe
    fontforge -script select.pe imamp.ttf imamp-select.ttf
    If you're not using bibtex, you can leave out FOO.bib above. This creates a smallish ttf file, named imamp-select.ttf.
  9. You modify the TeX file so that it says
    \AtBeginDvi{\pdfmapline{=ipamp@Unicode@ <ipamp-select.ttf}}
    so that it uses the smaller font you just prepared.
  10. Here is the final sample TeX file and the generated PDF. That's it!