| Non-ASCII characters in stories [message #23281] |
Fri, 23 January 2009 20:54  |
 |
CNash Messages: 73 Registered: September 2008 |
|
|
|
I've noticed that while most of the stories are encoded in regular ASCII/ANSI Windows format, there are some that are UTF-8 (Unicode) or otherwise use non-ASCII characters like the "curly quotes" that don't render correctly in editors and readers that aren't set up to handle them (showing up either as garbage characters or question marks). As I'm using a text reader program that doesn't render non-ASCII characters, I have to run each story through a converter program before I can read it properly.
I really don't mean to complain when all of this is presented for free (and I certainly wouldn't expect anyone to go back through stories and change them to suit my whims!), but if it's not too much trouble, could future stories be uploaded with only ASCII-standard characters?
There are worlds where the sky is burning, and the sea's asleep, and the rivers dream. People made of smoke, and cities made of song. Somewhere there's danger, somewhere there's injustice, and somewhere else the tea's getting cold...
|
|
|
|
|
|
|
| Re: Non-ASCII characters in stories [message #23440 is a reply to message #23281 ] |
Sun, 25 January 2009 17:19   |
 |
CNash Messages: 73 Registered: September 2008 |
|
|
|
I hadn't considered formatting like italics or bold because... well, I didn't actually know that the stories had it, as I read them all in plain text format and thus never see it! 
But I do see your point, and as I said, I wouldn't want anyone to do unreasonable amounts of work for the benefit of a minority (which might have only one member!). I'll continue as I've been doing.
One thing, though - double-spaces between words seem to occasionally show up as a special character and produce question marks. According to NoteTab, which I'm using to batch-convert the stories from HTML to text and replace curly quotes etc., the character converts to ASCII and produces this:
á
I can filter this, but I'm not sure why it's there in the first place. Really, I'm a bit of a newbie when it comes to character sets...
There are worlds where the sky is burning, and the sea's asleep, and the rivers dream. People made of smoke, and cities made of song. Somewhere there's danger, somewhere there's injustice, and somewhere else the tea's getting cold...
|
|
|
| Re: Non-ASCII characters in stories [message #23443 is a reply to message #23440 ] |
Sun, 25 January 2009 17:49   |
storyreader2005 Messages: 88 Registered: July 2005 Location: Ohio |
|
|
|
| Quote: | One thing, though - double-spaces between words seem to occasionally show up as a special character and produce question marks.
|
One of the characters in that double space is a "non-break space",in html code.
That is because 2 regular spaces in a HTML page do not show up. The browsers display a single space only.
|
|
|
| Re: Non-ASCII characters in stories [message #23565 is a reply to message #23281 ] |
Mon, 26 January 2009 13:21   |
 |
Rabiata Messages: 521 Registered: July 2008 Location: Germany |
|
|
|
About formatting:
HTML formatting is not the problem here, as the HTML tags like <i>sometext</i> (italics in this case) consist of ASCII characters that match in all the common character encodings. I have yet to see a browser that fails to understand these.
The real problem is that ASCII only defines the character codes 0-127, and character encodings frequently differ for the codes 128-255. That leads to a mess has not been fully cleaned up yet through standardization.
|
|
|
|
|
| Re: Non-ASCII characters in stories [message #23570 is a reply to message #23567 ] |
Mon, 26 January 2009 13:49   |
XaltatunOfAcheron Messages: 1930 Registered: July 2005 Location: Atlantis |
|
|
|
| oljak.eru wrote on Mon, 26 January 2009 11:33 |
| Rabiata wrote on Mon, 26 January 2009 19:21 | The real problem is that ASCII only defines the character codes 0-127, and character encodings frequently differ for the codes 128-255. That leads to a mess has not been fully cleaned up yet through standardization.
| We have a solution that works for nearly any situation - Unicode. Just encode the content of the HTML in UTF-8, then attach an HTTP Content-Type header that tells the browser it's encoded in UTF-8.
The problem here is that these documents are encoded using Windows-1252 but the server is attaching an HTTP Content-Type header that tells the browser the document is UTF-8. But every character in the document is available in the UTF-8 encoding, so the easiest way to fix the problem is to encode the document in UTF-8 instead. There's probably a setting in whatever program the canon authors use to generate UTF-8 instead of Windows-1252, which should solve the problem. Another way to solve the problem is to script an encoding conversion that happens automatically when stories are uploaded.
|
Actually, Warren has been handling the issue quite well - it's just that he forgot to run the story through HTMLTidy the last couple of times. This happens, especially when you've got an elderly relative who's on death's doorstep.
Bob Arnold runs the server (not Warren), and he's the guy who needs to look into changing the character type it's putting out to Windows-1252. Since (almost) all the authors use Word or an equivalent to create their stories, this would simplify things all around. However, it's his call, and he may very well have considerations I don't know about.
Xaltatun
Oxymoron: Jumbo Shrimp
Impossible: Sustainable Growth
|
|
|
|