UTF-8 with C++

UTF-8 is the variable width character encoding, which uses one to four 8 bits (1 byte) code units for the text representation. UTF-16 is also variable width encoding using one or two 16 bits (2 bytes) code units. UTF-32 is the fixed width encoding using exactly 32 bits (4 bytes) for each code point.

String representation

String literal is of type const char[]. The class string has constructor for a string literal. The size of a string literal includes the terminating null character, while the string excludes it.

string str = "Zdravo, Svete!";
const char s[] = "Zdravo, Svete!";
cout << str.size() << endl; // prints 14
cout << sizeof(s) / sizeof(*s) << endl; // prints 15
                

The UTF strings can be represented as:

const char s[] = u8"Здраво, Свете!";// until C++17
const string str = u8"Здраво, Свете!";// until C++17
const char8_t u8s[] = u8"Здраво, Свете!";// starting from C++20
const u8string u8str = u8"Здраво, Свете!";// starting from C++20
const char16_t u16s[] = u"Здраво, Свете!";
const u16string u16str = u"Здраво, Свете!";
const char32_t u32s[] = U"Здраво, Свете!";
const u32string u32str = U"Здраво, Свете!";
                
The size of s and u8s are two bytes for every letter, plus one byte for the comma, space and exclamation mark, plus one byte for the null character; in total that is 26 bytes for the size. The size of u16s is 30 bytes, since each character takes two bytes. The size of u32s is 60 bytes, since each character takes four bytes.

Change in C++20

With the version C++20 there is a breaking change for UTF-8 strings. It's not possible anymore to assign the UTF-8 string literal to char*, but rather char8_t*. That can be seen, for instance, with the overloaded function which takes both string and u8string:

#include <iostream>
#include <string>


using namespace std;


int len(string s)
{
    cout << "len(string):" << endl;
    return s.size();
}


#if __cplusplus > 201703L
int len(u8string s)
{
    cout << "len(u8string):" << endl;
    return s.size();
}
#endif


int main()
{
    cout << len(u8"Здраво, Свете!") << endl;

    return 0;
}
                

The char8_t* string literal can be written onto a file but it's not the content which is written. Similar change is for the string class. However, the u8string cannot be written by the fstream class.

#include <fstream>
#include <string>


using namespace std;


int main()
{
    char s1[] = "Здраво, Свете!";
    string ss1 = "Здраво, Свете!";
    #if __cplusplus > 201703L
    char8_t s2[] = u8"Здраво, Свете!";
    u8string ss2 = u8"Здраво, Свете!";
    #else
    char s2[] = u8"Здраво, Свете!";
    string ss2 = u8"Здраво, Свете!";
    #endif

    std::ofstream ofs("text.txt");
    ofs << s1 << endl << s2 << endl;
    ofs << ss1 << endl;
    #if __cplusplus <= 201703L
    ofs << ss2 << endl;
    #endif

    return 0;
}
                

Conversion

There is nothing in the string class which enforces the UTF-8 or any other encoding. It is just a sequence of bytes. Thus, to switch between the const char* and const char8_t* (and the corresponding string literals), the reinterpret cast may be used (according to char8_t backward compatibility remediation"):

const char* s = reinterpret_cast<const char*>(u8"Здраво, Свете!");
                
and
const char8_t* u8s = reinterpret_cast<const char8_t*>("Здраво, Свете!");
                

For the string classes, a conversion from u8string to string and vice versa is made by using C strings:

u8string u8str1 = u8"Здраво, Свете!";
string s(reinterpret_cast<const char*>(u8str1.c_str()));
cout << s << endl;
u8string u8str2(reinterpret_cast<const char8_t*>(s.c_str()));
cout << boolalpha << (u8str1 == u8str2) << endl; // prints true
                
Currently this is the only way to print new UTF-8 strings in C++20.

C functions: c8rtomb, c16rtomb, c32rtomb, mbrtoc8, mbrtoc16, mbrtoc32.

Printing hex values of all string variants.

ready.

10 print "mail: contact at alepho.com | skype: karastojko | stackoverflow: karastojko | github: karastojko"
20 print "(c) 2009-2023 www.alepho.com"