Creating New Scripts with StyleGAN

I applied StyleGAN to images of Unicode characters to see if it could invent new characters. I found some interesting results:

The world’s scripts

The world’s languages use about 400 different scripts for their writing systems. This includes Latin Script, which is the most widely used today:

The Unicode Consortium aims to map every character in the world to an underlying number so that they can be easily used across different computer systems. For example, hash “#” is mapped to the number 35, a-acute “á” is 225, and the Chinese character for fog “雾” is mapped to 38,654. The first ~65,000 characters in Unicode cover most scripts in current use and are divided into ~140 blocks, with Simple Latin being one of those blocks. See the Wikipedia Page on Unicode Blocks for the full list.

I served on the Unicode Consortium for a short time. I was appointed by the Linguistic Society of America as the alternate to focus on under-represented languages. I was fascinated to be a first hand witness the process of how scripts are formalized and encoded so that everyone in the world can take advantage of the information age, no matter how they choose to communicate. My time on the Unicode Consortium was at the time that emoji were first being added to Unicode, which was probably their most controversial decision. You can read my article about witnessing emoji being added to Unicode here:

One of the hardest decisions is what to name a given script in the Unicode standard. Simple Latin, as defined in Unicode, includes the common punctuations characters and the numbers. Obviously, there are more languages than Latin that use this script, and you could argue whether or not punctuation is part of the script or not. You would also be right to point at that all the numbers except 0 came from Arabic. So, the “blocks” in Unicode try to map to meaningful divisions in scripts, acknowledging that the boundaries can be fuzzy and that there will often be multiple names for scripts that might have political implications. So the names are convenient short-hand for the blocks in Unicode, but are not intended to be the primary or only name for people using that script for their language.

With that caveat, here are some interesting scripts that are in Unicode today, which I used as the basis for creating new ones with StyleGAN:

I acknowledge that in my examples above, I over-emphasized languages from South Asia. I’m writing this article after getting up early to watch India and Pakistan play in the Cricket World Cup, so I’m highlighting the diversity within their countries and probably within their teams. If you’re interested in language and sport, you should also see my analysis of names of the Football (Soccer) World Cup players here:

Data Preparation

I generated a JPG image for every single unicode character that could be rendered using the python Pillow library and the “Ariel Unicode” font that shipped with my MAC. If you want to recreate this, let me know and I’ll share my code.

UPDATE: code is here: https://github.com/rmunro/unicode_image_generator

I encoded the images in every block (every range of characters that correlates with one script) with a different color, so that I could easily see the biggest influences in the final set of characters. The colors in the images above reflect that: Latin is black, Tamil is bright green, etc.

Initially, this generated ~40,000 images. This is how many of the 65,000 characters had some sort of rendering by the Ariel Unicode font. It would be fun to experiment with a font that had greater coverage, especially across older scripts that are no longer used and other characters like emojis. It would also be interesting to generate characters using multiple different fonts. Let me know if you do this!

Of that 40,000, most ended up being Chinese and Japanese characters. I quickly abandoned one experiment where StyleGAN was only generating new characters that looked like Chinese and Japanese characters.

For every block of more than 256 characters, I randomly selected a subset of 256 characters. This took the data from 40,000 to about 7,000 characters. I used that 7,000 to train the model whose results I’m sharing in this article.

To see the first 20 characters in every block plus their colors, I’ve listed them on my personal website here:

StyleGAN

StyleGAN is an image-generating system implemented in TensorFlow that uses Generative Adversarial Networks (GANs). It was developed at NVidia and released as open-source code here:

For more depth on their approach, see this paper:

  • A Style-Based Generator Architecture for Generative Adversarial Networks. Tero Karras, Samuli Laine and Timo Aila. http://stylegan.xyz/paper

StyleGAN has most famously been used to create “realistic” looking photos of people that do not actually exist:

People have been using this to generate other fake images, and I was inspired by a few of these, including Miles Brundage’s use of StyleGAN to create new Battlestar Galactica images:

Experiment

With my stratified sample of 7,000 images, color-coded according to their Unicode Block, I ran styleGAN for exactly one week on a P2 AWS instance. I used the Deep Learning AMI and the only additional libraries I needed to install were for generating the images from fonts.

Again, if you’re interested, let me know and I’ll upload the code. It was a small modification to the styleGAN code. The hardest part was to get image and fonts libraries to play nice in python so that I could programmatically generate and color the images for every Unicode character.

Results

Ideally, the results should look like actual characters, but not literally look like any of the characters in Unicode today.

Here’s a selection of the real ones that the system trained on:

To begin with, the results weren’t very convincing. After 10 sample images (“ticks” in styleGAN’s system), they were indistinct blurs:

But after 30 ticks, we start to see some clear examples:

The examples after 30 ticks look realistic when zoomed out, but kind of alien when zoomed in where there isn’t a clear distinction between straight and curved lines:

Here again is the image at the start of this article, which is after 78 ticks, which now has some very clear examples:

The distinction between straight and curved lines is now clear and the accents and diacritics are now more distinct from the characters themselves. The resolution has doubled between 30 and 78 ticks in StyleGAN’s training, which also helps.

Here are some of my favorites at 78 ticks, with the colors telling us where they took their influence:

I’m really impressed with how realistic these characters seem! With only a couple of exceptions, look all like they could belong in the script of some language.

There’s a handful of examples that are already characters in Unicode. These could be examples where they weren’t part of the random selection of 256 for that block, or maybe they are offset or scaled in different ways. I haven’t looked into why these are erroneously present.

The training hadn’t completed by the 78th tick. I constrained this to exactly one week’s training mainly to fit my personal schedule. I’m sure it would produce even more interesting and convincing characters if it ran for longer.

It only took half a day to get the experiment running and a couple of hours for this analysis and write-up. Or to put it in Cricket World Cup terms, it took one innings to write the code and launch the process, and then 27 overs to analyze the results and write this blog post. If you’re also a coder who likes cricket, this is a great way to multitask while watching the World Cup!

There was only one way that the results fell short of my expectations: I hoped that some of the new characters would be rainbow-colored and show influence from multiple scripts at once. In reflection, I can see why this wasn’t the case: there were no multicolored examples in the training data and so multicolored examples would not be convincing adversarial examples.

Why generate new characters?

Beyond the fun factor, there are some practical use cases here:

  • Identify new characters for new scripts. Only half the world’s languages have adopted a script. It is often controversial for a language community to adopt the script of a former colonizer or invader. This method could provide a list of candidate characters that we already know don’t exist, but are stylistically consistent with scripts world-wide.
  • Understanding the visual properties of scripts. The fakes that are generated all tell us something interesting about the visual property of the script: the choice of curves vs lines, the distribution of information in different parts of that characters space, etc. So they tell us something interesting about how we encode information in similar or different ways across different scripts.
  • Creating new scripts for creative use cases. There are many interesting examples of fake scripts used in books and movies, from The Lord of the Rings to Star Trek. If you don’t have the budget to employ David J Peterson, this method can produce more realistic scripts than the random symbols that you sometimes see in low-budget sci-fiction films.

If you know of any other interesting applications of StyleGAN, or other methods for generating new characters, please let me know!

UPDATE: code is here: https://github.com/rmunro/unicode_image_generator

Private/Global Machine Learning at @Apple | Runs @BayAreaNLP | Wrote bit.ly/human-in-the-l… | Prev @StanfordNLP @AWSCloud | Opinions my own | 🚲🌍 | they/he

Sign up for The Variable

By Towards Data Science

Every Thursday, the Variable delivers the very best of Towards Data Science: from hands-on tutorials and cutting-edge research to original features you don't want to miss. Take a look.

Your home for data science. A Medium publication sharing concepts, ideas and codes.

The best way to understand your data is by taking the time to explore it.

Introduction

One of the most important skills that every Data Scientist must master is the ability to explore data properly. Thorough exploratory data analysis (EDA) is essential in order to ensure the integrity of your gathered data and performed analysis. The example used in this tutorial is an exploratory analysis…

Share your ideas with millions of readers.

Especially for an aspiring data scientist

A hackathon is a competition where various people, including programmers, developers and graphic designers, collaborate to design software projects intensively in a short period of times. The goal of a hackathon is to create a functional product at the end of the event.

Last month, I had an opportunity to…