What every developer should know about character encoding

Industry

I wrote an earlier post like this about timezones assuming 20 people would read it (and 3 would find it useful). And the first day I had over 9,000 page views. So here I am with the same attempt on character encoding. If you write code that touches a text file, you probably need this.

Lets start off with two key items

Unicode does not solve this issue for us (yet).
Every text file is encoded. There is no such thing as an unencoded file or a "general" encoding.

And lets add a codacil to this – most Americans can get by without having to take this in to account – most of the time. Because the characters for the first 127 bytes in the vast majority of encoding schemes map to the same set of characters (more accurately called glyphs). And because we only use A-Z without any other characters, accents, etc. – we're good to go. But the second you use those same assumptions in an HTML or XML file that has characters outside the first 127 – then the trouble starts.

The computer industry started with diskspace and memory at a premium. Anyone who suggested using 2 bytes for each character instead of one would have been laughed at. In fact we're lucky that the byte worked best as 8 bits or we might have had fewer than 256 bits for each character. There of course were numerous charactersets (or codepages) developed early on. But we ended up with most everyone using a standard set of codepages where the first 127 bytes were identical on all and the second were unique to each set. There were sets for America/Western Europe, Central Europe, Russia, etc.

And then for Asia, because 256 characters were not enough, some of the range 128 – 255 had what was called DBCS (double byte character sets). For each value of a first byte (in these higher ranges), the second byte then identified one of 256 characters. This gave a total of 128 * 256 additional characters. It was a hack, but it kept memory use to a minimum. Chinese, Japanese, and Korean each have their own DBCS codepage.

And for awhile this worked well. Operating systems, applications, etc. mostly were set to use a specified code page. But then the internet came along. A website in America using an XML file from Greece to display data to a user browsing in Russia, where each is entering data based on their country – that broke the paradigm.

Fast forward to today. The two file formats where we can explain this the best, and where everyone trips over it, is HTML and XML. Every HTML and XML file can optionally have the character encoding set in it's header metadata. If it's not set, then most programs assume it is UTF-8, but that is not a standard and not universally followed. If the encoding is not specified and the program reading the file guess wrong – the file will be misread.

Point 1 – Never treat specifying the encoding as optional when writing a file. Always write it to the file. Always. Even if you are willing to swear that the file will never have characters out of the range 1 – 127.

Now lets' look at UTF-8 because as the standard and the way it works, it gets people into a lot of trouble. UTF-8 was popular for two reasons. First it matched the standard codepages for the first 127 characters and so most existing HTML and XML would match it. Second, it was designed to use as few bytes as possible which mattered a lot back when it was designed and many people were still using dial-up modems.

UTF-8 borrowed from the DBCS designs from the Asian codepages. The first 128 bytes are all single byte representations of characters. Then for the next most common set, it uses a block in the second 128 bytes to be a double byte sequence giving us more characters. But wait, there's more. For the less common there's a first byte which leads to a sersies of second bytes. Those then each lead to a third byte and those three bytes define the character. This goes up to 6 byte sequences. Using the MBCS (multi-byte character set) you can write the equivilent of every unicode character. And assuming what you are writing is not a list of seldom used Chinese characters, do it in fewer bytes.

But here is what everyone trips over – they have an HTML or XML file, it works fine, and they open it up in a text editor. They then add a character that in their text editor, using the codepage for their region, insert a character like ß and save the file. Of course it must be correct – their text editor shows it correctly. But feed it to any program that reads according to the encoding and that is now the first character fo a 2 byte sequence. You either get a different character or if the second byte is not a legal value for that first byte – an error.

Point 2 – Always create HTML and XML in a program that writes it out correctly using the encode. If you must create with a text editor, then view the final file in a browser.

Now, what about when the code you are writing will read or write a file? We are not talking binary/data files where you write it out in your own format, but files that are considered text files. Java, .NET, etc all have character encoders. The purpose of these encoders is to translate between a sequence of bytes (the file) and the characters they represent. Lets take what is actually a very difficlut example – your source code, be it C#, Java, etc. These are still by and large "plain old text files" with no encoding hints. So how do programs handle them? Many assume they use the local code page. Many others assume that all characters will be in the range 0 – 127 and will choke on anything else.

Here's a key point about these text files – every program is still using an encoding. It may not be setting it in code, but by definition an encoding is being used.

Point 3 – Always set the encoding when you read and write text files. Not just for HTML & XML, but even for files like source code. It's fine if you set it to use the default codepage, but set the encoding.

Point 4 – Use the most complete encoder possible. You can write your own XML as a text file encoded for UTF-8. But if you write it using an XML encoder, then it will include the encoding in the meta data and you can't get it wrong. (it also adds the endian preamble to the file.)

Ok, you're reading & writing files correctly but what about inside your code. What there? This is where it's easy – unicode. That's what those encoders created in the Java & .NET runtime are designed to do. You read in and get unicode. You write unicode and get an encoded file. That's why the char type is 16 bits and is a unique core type that is for characters. This you probably have right because languages today don't give you much choice in the matter.

Point 5 – (For developers on languages that have been around awhile) – Always use unicode internally. In C++ this is called wide chars (or something similar). Don't get clever to save a couple of bytes, memory is cheap and you have more important things to do.

Wrapping it up

I think there are two key items to keep in mind here. First, make sure you are taking the encoding in to account on text files. Second, this is actually all very easy and straightforward. People rarely screw up how to use an encoding, it's when they ignore the issue that they get in to trouble.

What every developer should know series:

What every developer should know series

‍

If you've just discovered us, we're excited. Know more about Windward and get your 14-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Tags Start & End

Tags Can Start & End Anywhere

Appendix B

.NET code for multi-page image output

Appendix A

Java code for multi-page image output

Data Bin Search

The Data Bin can now be searched to find a table, column, node or other piece of data without scrolling through it all.

Shrink to Fit

This will shrink the contents of a cell until it fits the defined cell size.

Time Zone Conversion

A new Windward macro has been added to help with converting dates and times from UTC time to the local time zone.

Image Output Format

New image output formats added.

PostScript Output Format

PostScript, commonly used with printers and printing companies, has been added as an additional output format.

New and Improved Datasets (Designer, Java Engine, .NET Engine)

Datasets have been re-written from scratch to be more powerful and easier to use.

Stored Procedure Wizard (Designer)

This works for all tag types that are connected to a SQL-based data source (Microsoft SQL Server, Oracle, MySQL, or DB2).

Boolean Conditional Wizard (Designer)

Before, conditional statements could only be written manually. Now they can also be built using our intuitive Wizard interface.

Reorganized Ribbon

The ribbon menus have been re-organized and consolidated to improve the report design workflow.

XPath 2.0 as Data Source

Adds various capabilities such as inequalities,descending sort, joins, and other functions.

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

SQL Select Debugger

The look and feel was improved
Stored Procedure Wizard
Improved Exceptions pane

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Tag Editor/Tag Selector

Added a Query tab as a field for typing or pasting in a select statement

Color Coding of Keywords
TypeAhead
Evaluate is now "Preview"

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Rename a Datasource

All tags using that Data source will be automatically updated with that name.

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Connecting to a Data Source

New single interface to replace 2 separate dialog boxes

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Tag Tree

Displays of all the tags in the template, structured as they are placed in the template. This provides a simple & intuitive way to see the structure of your template. Also provides the capability to go to any tag and/or see the properties of any tag.

Added Javelin into the RESTful Engine

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Support for Google Application Engine Integration

The ability to integrate the Windward Engine into Google’s cloud computing platform for developing and hosting web applications dubbed Google Applications Engine (GAE).

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Additional Refinement for HTML Output

Improved indentation for ordered and unordered lists
Better handling of template header and footer images
Better handling for background images and colors

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Redesigned PDF Output Support

This new integration will allow for processing of complex scripts and bi-directional text such as Arabic. Your PDF output will be much tighter and more closely match your template, and we’ll be able to respond rapidly to PDF requests and fixes.

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

PowerPoint Support

Includes support for new ForEach and slide break handling, table header row repeat across slide breaks, and native Microsoft support for charts and images.

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Tags are Color Coded

Tags are color coded in the template by type, making it easy to visually identify them.

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Increased Performance

Version 13’s core code has been reworked and optimized to offer a reduced memory footprint, faster PDF generation and full documentation of supported features and limitations in the specifications for DOCX, XLSX and PPTX.

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Advanced Image Properties

Documents can include advanced Word image properties such as shadows, borders, and styles.

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Improved HTML Output

Windward has updated HTML output to reflect changing HTML standards.

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Version 13 New Data Sources

Windward now works with a slew of new datasources: MongoDB, JSON, Cassandra, OData, Salesforce.com

‍

If you've just discovered us, we're excited. Learn more about Windward document automation software now.

‍

Try Windward with our 30-day free trial and start creating documents in quick time with our low/no code solutions.

‍

Generate Code

The Generate Code tool in the designer allows you to open an existing template and, with a click of a button, automatically create a window with the code needed to run your current template with all data sources and variables. Simply copy this code and paste into your application's code in the appropriate place. You now have Windward integrated into your application.

You only need to do this once. You do not do this for each template. Instead, where it has explicit files for the template and output, change that to parameters you pass to this code. Same for the parameters passed to Windward. This example uses explicit values to show you what to substitute in where.

Pivot Tables Adjusted in Output

Any pivot tables in an XLSX template are carried over to the XLSX output. The ranges in the pivot ranges are adjusted to match the generated output. So your final XLSX will have pivot tables set as expected in the generated file.

This makes creating an XLSX workbook with pivot tables trivial.

Imported Template Can be Set to Match the Parent Styles

In an imported sub-template, if its properties for a style (ex. Normal) differ from the parent template's properties for the style, the use in the sub-template can be set to either use the properties in the sub-template, or the properties in the parent.

You set to retain when you don't want the child template's styling to change when imported. You set to use the parent when you want the styling of the imported template to match the styling in the parent.

Any explicit styling is always retained. This only impacts styling set by styles.

Tags can be Placed in Text Boxes

Tags can be placed in text boxes. Including linked text boxes. This gives you the ability to set the text in a textbox from your data.

Tags can be Placed in Shapes & Smart Art

Tags can be placed in shapes & smart art. This gives you the ability to set the text in a shape from your data.

HTML Output Supports Embedded Images

When generating HTML output, the engine can either write bitmaps as distinct files the generate HTML references, or it can embed the images in the HTML providing a single file for the output.

Footnotes & Endnotes can Have Tags

You can place tags in pretty much any part of a template, including in footnotes & endnotes.

Document Locking Supported in DOCX & XLSX

Any parts of a DOCX or XLSX (PowerPoint does not support this) file that are locked in the template, will be locked the same in the output.

Specify Font Substitution

If a font used in the template does not exist on the server generating a report, the font to substitute can be specified.
In addition, if a glyph to be rendered does not exist in the font specified, you can specify the replacement font. This can be set distinctly for European, Bi-Directional, and Far East fonts.

Process Multiple Datasources Simultaneously

If you need this - it's essential. And if you don't need it, it's irrelevant.

Windward enables you to build a document by applying multiple datasources to the template simultaneously. When Windward is merging the data into a template, it processes the template by handling each tag in order, and each tag pulls from different datasources. (As opposed to processing all of one datasource, then processing the next.)

This allows the select tag to use data from another datasource in its select. For example, if you are pulling customer information from one data source, you can then pull data from the sales datasource using the customer ID of the customer presently processing to pull the sales information for that customer. If you're interested in patching together your data from multiple datasources, read this post on our blog.

Written by:_

David Thielen

President/CEO at Windward Studios

What every developer should know about character encoding

Wrapping it up

What every developer should know series:

Tags Start & End

Appendix B

Appendix A

Data Bin Search

Shrink to Fit

Time Zone Conversion

Image Output Format

PostScript Output Format

New and Improved Datasets (Designer, Java Engine, .NET Engine)

Stored Procedure Wizard (Designer)

Boolean Conditional Wizard (Designer)

Reorganized Ribbon

XPath 2.0 as Data Source

SQL Select Debugger

Tag Editor/Tag Selector

Rename a Datasource

Connecting to a Data Source

Tag Tree

Added Javelin into the RESTful Engine

Support for Google Application Engine Integration

Additional Refinement for HTML Output

Redesigned PDF Output Support

PowerPoint Support

Tags are Color Coded

Increased Performance

Advanced Image Properties

Improved HTML Output

Version 13 New Data Sources

Generate Code

Pivot Tables Adjusted in Output

Imported Template Can be Set to Match the Parent Styles

Tags can be Placed in Text Boxes

Tags can be Placed in Shapes & Smart Art

HTML Output Supports Embedded Images

Footnotes & Endnotes can Have Tags

Document Locking Supported in DOCX & XLSX

Specify Font Substitution

Process Multiple Datasources Simultaneously

Get tips straight to your inbox and become a better document creator.