# UTF8-CPP: UTF-8 with C++ in a Portable Way
## Introduction
C++ developers miss an easy and portable way of handling Unicode encoded strings. The original C++ Standard (known as C++98 or C++03) is Unicode agnostic. C++11 provides some support for Unicode on core language and library level: u8, u, and U character and string literals, char16_t and char32_t character types, u16string and u32string library classes, and codecvt support for conversions between Unicode encoding forms. In the meantime, developers use third party libraries like ICU, OS specific capabilities, or simply roll out their own solutions.
In order to easily handle UTF-8 encoded Unicode strings, I came up with a small, C++98 compatible generic library. For anybody used to work with STL algorithms and iterators, it should be easy and natural to use. The code is freely available for any purpose - check out the [license](./LICENSE). The library has been used a lot in the past ten years both in commercial and open-source projects and is considered feature-complete now. If you run into bugs or performance issues, please let me know and I'll do my best to address them.
The purpose of this article is not to offer an introduction to Unicode in general, and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out [Unicode Home Page](http://www.unicode.org/) or some other source of information for Unicode. Also, it is not my aim to advocate the use of UTF-8 encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from C++, I am sure you have good reasons for it.
## Examples of use
### Introductionary Sample
To illustrate the use of the library, let's start with a small but complete program that opens a file containing UTF-8 encoded text, reads it line by line, checks each line for invalid UTF-8 byte sequences, and converts it to UTF-16 encoding and back to UTF-8:
```cpp
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include "utf8.h"
using namespace std;
int main(int argc, char** argv)
{
if (argc != 2) {
cout << "\nUsage: docsample filename\n";
return 0;
}
const char* test_file_path = argv[1];
// Open the test file (must be UTF-8 encoded)
ifstream fs8(test_file_path);
if (!fs8.is_open()) {
cout << "Could not open " << test_file_path << endl;
return 0;
}
unsigned line_count = 1;
string line;
// Play with all the lines in the file
while (getline(fs8, line)) {
// check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)
#if __cplusplus >= 201103L // C++ 11 or later
auto end_it = utf8::find_invalid(line.begin(), line.end());
#else
string::iterator end_it = utf8::find_invalid(line.begin(), line.end());
#endif // C++ 11
if (end_it != line.end()) {
cout << "Invalid UTF-8 encoding detected at line " << line_count << "\n";
cout << "This part is fine: " << string(line.begin(), end_it) << "\n";
}
// Get the line length (at least for the valid part)
int length = utf8::distance(line.begin(), end_it);
cout << "Length of line " << line_count << " is " << length << "\n";
// Convert it to utf-16
#if __cplusplus >= 201103L // C++ 11 or later
u16string utf16line = utf8::utf8to16(line);
#else
vector<unsigned short> utf16line;
utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
#endif // C++ 11
// And back to utf-8;
#if __cplusplus >= 201103L // C++ 11 or later
string utf8line = utf8::utf16to8(utf16line);
#else
string utf8line;
utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
#endif // C++ 11
// Confirm that the conversion went OK:
if (utf8line != string(line.begin(), end_it))
cout << "Error in UTF-16 conversion at line: " << line_count << "\n";
line_count++;
}
return 0;
}
```
In the previous code sample, for each line we performed a detection of invalid UTF-8 sequences with `find_invalid`; the number of characters (more precisely - the number of Unicode code points, including the end of line and even BOM if there is one) in each line was determined with a use of `utf8::distance`; finally, we have converted each line to UTF-16 encoding with `utf8to16` and back to UTF-8 with `utf16to8`.
Note a different pattern of usage for old compilers. For instance, this is how we convert
a UTF-8 encoded string to a UTF-16 encoded one with a pre - C++11 compiler:
```cpp
vector<unsigned short> utf16line;
utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
```
With a more modern compiler, the same operation would look like:
```cpp
u16string utf16line = utf8::utf8to16(line);
```
If `__cplusplus` macro points to a C++ 11 or later, the library exposes API that takes into
account C++ standard Unicode strings and move semantics. With an older compiler, it is still
possible to use the same functionality, just in a little less convenient way
In case you do not trust the `__cplusplus` macro or, for instance, do not want to include
the C++ 11 helper functions even with a modern compiler, define `UTF_CPP_CPLUSPLUS` macro
before including `utf8.h` and assign it a value for the standard you want to use - the values are the same as for the `__cplusplus` macro. This can be also useful with compilers that are conservative in setting the `__cplusplus` macro even if they have a good support for a recent standard edition - Microsoft's Visual C++ is one example.
### Checking if a file contains valid UTF-8 text
Here is a function that checks whether the content of a file is valid UTF-8 encoded text without reading the content into the memory:
```cpp
bool valid_utf8_file(const char* file_name)
{
ifstream ifs(file_name);
if (!ifs)
return false; // even better, throw here
istreambuf_iterator<char> it(ifs.rdbuf());
istreambuf_iterator<char> eos;
return utf8::is_valid(it, eos);
}
```
Because the function `utf8::is_valid()` works with input iterators, we were able to pass an `istreambuf_iterator` to it and read the content of the file directly without loading it to the memory first.
Note that other functions that take input iterator arguments can be used in a similar way. For instance, to read the content of a UTF-8 encoded text file and convert the text to UTF-16, just do something like:
```cpp
utf8::utf8to16(it, eos, back_inserter(u16string));
```
### Ensure that a string contains valid UTF-8 text
If we have some text that "probably" contains UTF-8 encoded text and we want to replace any invalid UTF-8 sequence with a replacement character, something like the following function may be used:
```cpp
void fix_utf8_string(std::string& str)
{
std::string temp;
utf8::replace_invalid(str.begin(), str.end(), back_inserter(temp));
str = temp;
}
```
The function will replace any invalid UTF-8 sequence with a Unicode replacement character. There is an overloaded function that enables the caller to supply their own replacement character.
## Points of interest
#### Design goals and decisions
The library was designed to be:
1. Generic: for better or worse, there are many C++ string classes out there, and the library should work with as many of them as possible.
2. Portable: the library should be portable both accross different platforms and compilers. The only non-portable code is a small section that declares unsigned integers of different sizes: three typedefs. They can be changed by the users of the library if they don't match their platform. The default setting should work for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives. Support for post C++03 language features is included for modern compilers at API level only, so the library should work even with pretty old compilers.
3. Lightweight: follow the "pay only for w
没有合适的资源?快使用搜索试试~ 我知道了~
温馨提示
【项目资源】: 包含前端、后端、移动开发、操作系统、人工智能、物联网、信息化管理、数据库、硬件开发、大数据、课程资源、音视频、网站开发等各种技术项目的源码。 包括STM32、ESP8266、PHP、QT、Linux、iOS、C++、Java、python、web、C#、EDA、proteus、RTOS等项目的源码。 【项目质量】: 所有源码都经过严格测试,可以直接运行。 功能在确认正常工作后才上传。 【适用人群】: 适用于希望学习不同技术领域的小白或进阶学习者。 可作为毕设项目、课程设计、大作业、工程实训或初期项目立项。 【附加价值】: 项目具有较高的学习借鉴价值,也可直接拿来修改复刻。 对于有一定基础或热衷于研究的人来说,可以在这些基础代码上进行修改和扩展,实现其他功能。 【沟通交流】: 有任何使用上的问题,欢迎随时与博主沟通,博主会及时解答。 鼓励下载和使用,并欢迎大家互相学习,共同进步。
资源推荐
资源详情
资源评论
收起资源包目录
基于ESP32的电子墨水屏日历.zip (63个子文件)
资料总结
NeoDateTime
NeoDateTime.vcxproj.filters 334B
Program.cpp 422B
NeoDateTime.sln 1KB
NeoDateTime.h 2KB
NeoDateTime.vcxproj 7KB
NeoDateTime.cpp 8KB
LICENSE 1KB
Weather
weather.txt 38B
weather.json 4KB
CalendariumNovumESP32.jpg 727KB
Calendarium
lib
utf8
LICENSE 1KB
version.txt 129B
source
utf8.h 1KB
utf8
unchecked.h 11KB
checked.h 12KB
core.h 11KB
cpp11.h 3KB
README.md 49KB
.vscode
settings.json 1KB
extensions.json 274B
ScreenText.txt 227B
src
Program.cpp 5KB
Weather.cpp 3KB
Font.h 616B
ArduinoJson.h 219KB
components
epd
buff.h 2KB
epd3in7.h 6KB
epd4in2.h 6KB
epd2in13.h 10KB
epd5in65f.h 2KB
epd4in01f.h 2KB
epd2in7.h 7KB
epd7in5_HD.h 6KB
epd2in9.h 5KB
epd.h 23KB
epd5in83.h 3KB
epd2in66.h 4KB
epd7in5.h 7KB
epd1in54.h 5KB
CP20936.cpp 1.41MB
NeoDateTime.h 2KB
StringUtilities.h 376B
CP20936_desc.cpp 321KB
Weather.h 316B
NeoDateTime.cpp 8KB
Graphics.h 484B
StringUtilities.cpp 1KB
Graphics.cpp 6KB
platformio.ini 714B
.gitignore 94B
Font
CP20936.cpp.bmp 240KB
CP20936.fd 552KB
FontConverter
FontConverter.sln 1KB
Properties
AssemblyInfo.cs 1KB
Program.cs 3KB
FontConverter.csproj 2KB
Firefly.Core.dll 528KB
Firefly.Core.xml 150KB
CP20936.bmp 1.88MB
CP20936.cpp 1.41MB
CP20936_desc.cpp 313KB
CP20936.tbl 89KB
README.md 1KB
共 63 条
- 1
资源评论
妄北y
- 粉丝: 1w+
- 资源: 1万+
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功