Instagram engineer delves into emoji madness
Last month, the Instagram photo-sharing service started recognizing emojis in its hashtag searches, making the company the first major social networking service to offer this capability. A user could affix a sprightly emoji to a photo hashtag so the snap could be found by other users searching for that emoji. The Internet rejoiced.
Now, one of the Instagram engineers responsible for this technical feat has shared the company's approach in a blog item posted Wednesday that should be perused by any developer looking to outfit a social Internet service or consumer app with similar emoji goodness. Turns out that supporting the little digital icons is no easy task.
"Identifying characters can be difficult across programming languages. Only by parsing the standard, finding character variations and understanding language differences do they become possible to support," Instagram engineer Piyush Mangalick wrote in the new post.
While elders may bemoan emojis' putative deleterious effect on language, one thing is for sure: The youth love them. Today, almost 60 percent of user text generated on Instagram contains emojis. Among Instagram's 300 million users, emojis are now more widely used than acronyms. LOL.
First popularized in Japan during the last decade, emojis convey a wide range of subjects and emotions through the use of simple symbols and pictographs, usually fitted on a 12-by-12-pixel grid. They are often used as shorthand to eliminate the laborious typing of words on small devices. The Unicode standard for encoding the world's languages on computers adopted a set of 1,282 emojis in 2010, which paved the way for their widespread use on Apple and Android devices.
Including emojis in Instagram's hashtag index at first seemed like a simple task. With Unicode, each character -- be it a letter, symbol or emoji -- is represented by a string of hexadecimal numbers, which a programming language or operating system can translate into the appropriate character by using the Unicode guide.
Unfortunately, creating a single way to search these raw Unicode strings across different platforms was not possible, Mangalick said. Emojis used a subset of Unicode, called UTF-16, that allows the numeric strings to be of differing lengths. That made them tricky to parse, given that different programming languages used different escape keys, or markers, to signify the end of the numeric string. Additionally, some emojis required two strings of numbers.
Apple muddied the waters further by offering users the ability to encode some emojis in various colors, which resulted in non-standard strings. Android also had a set of non-standard emoji encodings. For Instagram to use emojis correctly, an Android device had to recognize an iPhone emoji, and vice versa.
For the solution, Instagram turned to regular expressions, a dense but extremely versatile language for searching for patterns in text. Regular expressions, called regex for short, were designed for tasks such as recognizing complex sets data strings within larger, more complex strings of data.
In the IT world, regular expressions searches justifiably have gained a reputation for being fiendishly complicated. Instagram's regular expressions for finding emojis may be the most complicated yet.
The company painstakingly crafted a regex search pattern for Python 2.7, the company's preferred language for its back-end search service, that would identify all the possible emojis a user could use. The list was more than 3,600 characters long. Imagine entering that into Google without a single mistake.
And that was just the regex for Python. Instagram had to identify emojis across all the platforms it supported. So company engineers had to craft separate, though equally voluminous, regex patterns for Google's and Apple's choices, Java and Objective-C.
The work paid off, however, not only in terms of the positive publicity that the emoji support generated for Instagram, but also by helping the company stay in touch with its digitally expressive user base. If emojis ever do surpass the use of text itself, as pundits fear and Instagram predicts, then Instagram is well poised for this colorful future.
Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab's e-mail address is Joab_Jackson@idg.com