Re: Re: Issues when importing Word document with endnotes

As promised,

The longer, somewhat technical* story

* This is still something of an understatement.

The underlying problem is that Microsoft Word is written by Americans, for Microsoft Windows. The basic encoding of text in Word is Windows Latin1, that is, ye olde "A to Z" plus a somewhat random smattering of accented letters, as well as some useful typographic characters. This basic encoding is designed by Microsoft and is perfect for English texts (well, American English), but not that good to write French, German, or Spanish, and it simply ignores Czech, Hungarian, or Polish special characters ... not to mention entirely different alphabets (think of Telugu, Tibetan, and Thai -- three random ones that start with a "T" and there are lots more).

So, wiser heads invented Unicode, a much larger system that can define millions upon millions of different characters, and continues to grow even as we speak. (Although, famously, they refuse to include the Klingon alphabet.)

Now instead of re-writing their word processor from the ground up, Microsoft's programmers took a shortcut in implementing Unicode. Word files henceforward consisted of two types of blocks of characters: "regular" (i.e., Good Ole American text) and "Unicode" (a.k.a. The Rest of the World). Each single character in the "regular" block uses only one byte, each single character in a Unicode uses at least two bytes (and possibly more). In the file a list of text blocks is stored, indicating their offset inside the file, its length, and of course what type of block it is.

That means that even for a simple instruction such as "please import the plain text", there is a whole to-and-from calculations going on in the background. You cannot simply ask how long a block of text is: do you get the length in characters or in bytes? The numbers may differ. Adding to the fray, there doesn't seem to be much consensus inside the Word format of which data refers to bytes and which to characters. (See Understanding the Word .doc Binary File Format on Microsoft's own site if you want to know more about that.) In addition, Microsoft churned out lots of different versions of the .doc file format, and they just fibbed (pun intended) the file header to cater for new additions. Now the documentation is littered with "Obsolete" and "Do Not Use" remarks, and many values are unreliable, obsolete, "may not be up to date", and "may be expanded in future versions".

How does this tie in to your Endnote Problem? Real "endnotes" have a real simple structure in the file. In place of the actual footnote/endnote marker in your document, there is a one-byte code that basically states "here comes the next note". There is a bit more information stored elsewhere (which note number it is -- you can (gasp) select whatever number you want!), how it's supposed to be formatted (again a plus point for Word), and where the actual note text itself is stored. That's for regular endnotes; but yours aren't.

Another Word feature, employed by the Word extension you are using (confusingly also called "Endnote"), is that it can store any data you want inside a Field code. Field codes are something like the Text Variables of InDesign, but they can do far more. Automatic references are field codes; so are equations, page numbers, and hyperlinks.

A Field code typically contains displayable text (such as a superscript number) as well as hidden data (which can temporarily be made visible inside Word only). The Hidden text is what an extension such as Endnote uses to automatically construct a References section at the end of your document. It's the combination of automatic numbering and appearing at the end that makes it look and work just like regular endnotes inside Word.

So why does this hidden text suddenly appear? The programmers of InDesign made mistakes in the Word Import filter.

The blocks of "one-byte/two-byte" texts are independent of what sort of text or code is inside these blocks. This means the entire hidden 'endnote' data may be not inside a single block. The runs of "one byte/two byte" characters are totally independent of the actual meaning of those characters.

In practice, this means that one has to very carefully track which text should appear where, and how it should be read and translated to native InDesign text. Somewhere inside the Word-reading code, they forgot to translate a jump from one to two bytes (or the other way around) in the middle of a field code's text. So suddenly, a block that was rightfully hidden in Word got counted as, and incorporated into, the main run of text; and that is what you see in your post. After that sudden intrusion, the "regular" text continues as usual -- in your case, directly after the closing code "</Endnote>". You can see that the sentence runs on normally when ignoring the <...> trash data:

One study⁶[trash] investigated the association ...

The superscript "6" is the 'visible' part of the "Endnotes" field. The start of the Field code that normally hides the 'hidden' part is read correctly, and so its text is not imported; but the end of that field code is mis-calculated and so InDesign happily jumps right into Things That Should Not Be Seen.

Other issues caused by this mis-reading can be encountered as well:

InDesign forgets to insert the proper sequence number for a note, and you get a pink "unknown" character instead (the "unknown" character is actually a generic placeholder, and should have been replaced with the correct number).
InDesign looses track of where text fragments should go and so they end up at the end of a document: you get see several hard returns at the end of your text, sometimes with one or two characters still attached to them. If you find where they came from by comparing the imported text with the original Word file, you will see that the characters including the hard returns are missing in InDesign.
InDesign may skip a single Bold On or Bold Off code, and after that the Bold attribute is inverted for a while, typically up to the end of the paragraph (Word requires an explicit reset of all text attributes at the start of the next paragraph).
InDesign fails to read a certain code (a first asterisked note before the one numbered "1", for example), and everything after that gets shifted by the amount of bytes that note ought to have occupied. If ID is able to import the file, one of the most surprising results can be that this note gets placed as a footnote inside the very first footnote.
.. and of course, InDesign may not import a perfectly formed Word file at all. I hate it when that happens.

How can I be so sure it's a problem in InDesign and not in the Word file? Well ... as I have often advised in the past for similar problems, re-saving a document inside Word may solve it. That is because on a re-save, the text blocks are cleaned up -- just like a Save As in InDesign can solve lingering random problems. Saving as another file type may also work, because even though the file import filter uses something of a shared code base (there is a lot of similar functionality between reading DOC, RTF, and DOCX), the dirty low-level routines that actually have to deal with the raw bytes are, by definition, coded specifically for each file type. Which in practice means that an error of this kind in one filter may not be there in another.

With all this knowledge, and a browser bookmark on Microsoft's documentation, I wrote a Javascript to read Word files with. After some initial problems, I got it to work for plain text (so no notes or tables or auto-numbering), and much to my surprise I ran into similar problems: a relatively small error in my reading code produced the same kind of errors in my own imported text.

I found out I could fix this in my script, and would have gladly expanded it to read and format an entire Word file if only Javascript was a just teensy-bit faster... Reading a simple file the hard (but correct) way costs about 15 minutes, even on a fast system. So I resigned this idea and now always clean up my Word files in Word -- 95 out of a 100 times this works straight up, and for the remainders, a single glance to the source file is usually enough to spot the problem, which I then solve "manually" (i.e., moving it to the end of the file, or just deleting it).

Re: Re: Issues when importing Word document with endnotes

The longer, somewhat technical* story

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Lessons learned from suicide of student Joseph Evans

GTA 5 PPSSPP Zip File Download For Android Mediafire 382 MB

Telangana Ration Card Online Status Ahara Bhadratha Card Online Status

The 10 Tennessee Cities With The Largest Black Population For 2021

Outlook でメールを保存または送信時に...

£700k teaching scam claim emerges during sex probe into supply teacher

NY-PHIL Mafia’s “Peter Pan” Tuccio Got A Beat Down For Being Disrespectful To...

New Guidelines for settlement of Medical claims of pensioners and others in...

[GET] Jenna Kutcher – The Instagram Lab 2.0 ($297.00)

Demi Lovato – Tell Me You Love Me (Remixes) – 2018 – iTunes Plus AAC M4A – EP

Kerala Government Public Holidays 2016

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

Shivaji University Result 2017 BA B.Com B.Sc 1st, 2nd & 3rd Year परिणाम यंहा...

Moondru Mudichu 16-05-2017 – Polimer tv Serial

Bureau of Internal Revenue: Regional Offices (Directory)

Black Angus Grilled Artichokes

hide – REPSYCLE ~hide 60th Anniversary Special Box~ [CD FLAC + Blu-ray ISO]...

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

DJ Snake – Encore [iTunes Plus M4A]