# UTF8-CPP: UTF-8 with C++ in a Portable Way
## Introduction
C++ developers miss an easy and portable way of handling Unicode encoded strings. The original C++ Standard (known as C++98 or C++03) is Unicode agnostic. C++11 provides some support for Unicode on core language and library level: u8, u, and U character and string literals, char16_t and char32_t character types, u16string and u32string library classes, and codecvt support for conversions between Unicode encoding forms. In the meantime, developers use third party libraries like ICU, OS specific capabilities, or simply roll out their own solutions.
In order to easily handle UTF-8 encoded Unicode strings, I came up with a small, C++98 compatible generic library. For anybody used to work with STL algorithms and iterators, it should be easy and natural to use. The code is freely available for any purpose - check out the [license](./LICENSE). The library has been used a lot in the past ten years both in commercial and open-source projects and is considered feature-complete now. If you run into bugs or performance issues, please let me know and I'll do my best to address them.
The purpose of this article is not to offer an introduction to Unicode in general, and UTF-8 in particular. If you are not familiar with Unicode, be sure to check out [Unicode Home Page](http://www.unicode.org/) or some other source of information for Unicode. Also, it is not my aim to advocate the use of UTF-8 encoded strings in C++ programs; if you want to handle UTF-8 encoded strings from C++, I am sure you have good reasons for it.
## Examples of use
### Introductionary Sample
To illustrate the use of the library, let's start with a small but complete program that opens a file containing UTF-8 encoded text, reads it line by line, checks each line for invalid UTF-8 byte sequences, and converts it to UTF-16 encoding and back to UTF-8:
```cpp
#include <fstream>
#include <iostream>
#include <string>
#include <vector>
#include "utf8.h"
using namespace std;
int main(int argc, char** argv)
{
if (argc != 2) {
cout << "\nUsage: docsample filename\n";
return 0;
}
const char* test_file_path = argv[1];
// Open the test file (must be UTF-8 encoded)
ifstream fs8(test_file_path);
if (!fs8.is_open()) {
cout << "Could not open " << test_file_path << endl;
return 0;
}
unsigned line_count = 1;
string line;
// Play with all the lines in the file
while (getline(fs8, line)) {
// check for invalid utf-8 (for a simple yes/no check, there is also utf8::is_valid function)
#if __cplusplus >= 201103L // C++ 11 or later
auto end_it = utf8::find_invalid(line.begin(), line.end());
#else
string::iterator end_it = utf8::find_invalid(line.begin(), line.end());
#endif // C++ 11
if (end_it != line.end()) {
cout << "Invalid UTF-8 encoding detected at line " << line_count << "\n";
cout << "This part is fine: " << string(line.begin(), end_it) << "\n";
}
// Get the line length (at least for the valid part)
int length = utf8::distance(line.begin(), end_it);
cout << "Length of line " << line_count << " is " << length << "\n";
// Convert it to utf-16
#if __cplusplus >= 201103L // C++ 11 or later
u16string utf16line = utf8::utf8to16(line);
#else
vector<unsigned short> utf16line;
utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
#endif // C++ 11
// And back to utf-8;
#if __cplusplus >= 201103L // C++ 11 or later
string utf8line = utf8::utf16to8(utf16line);
#else
string utf8line;
utf8::utf16to8(utf16line.begin(), utf16line.end(), back_inserter(utf8line));
#endif // C++ 11
// Confirm that the conversion went OK:
if (utf8line != string(line.begin(), end_it))
cout << "Error in UTF-16 conversion at line: " << line_count << "\n";
line_count++;
}
return 0;
}
```
In the previous code sample, for each line we performed a detection of invalid UTF-8 sequences with `find_invalid`; the number of characters (more precisely - the number of Unicode code points, including the end of line and even BOM if there is one) in each line was determined with a use of `utf8::distance`; finally, we have converted each line to UTF-16 encoding with `utf8to16` and back to UTF-8 with `utf16to8`.
Note a different pattern of usage for old compilers. For instance, this is how we convert
a UTF-8 encoded string to a UTF-16 encoded one with a pre - C++11 compiler:
```cpp
vector<unsigned short> utf16line;
utf8::utf8to16(line.begin(), end_it, back_inserter(utf16line));
```
With a more modern compiler, the same operation would look like:
```cpp
u16string utf16line = utf8::utf8to16(line);
```
If `__cplusplus` macro points to a C++ 11 or later, the library exposes API that takes into
account C++ standard Unicode strings and move semantics. With an older compiler, it is still
possible to use the same functionality, just in a little less convenient way
In case you do not trust the `__cplusplus` macro or, for instance, do not want to include
the C++ 11 helper functions even with a modern compiler, define `UTF_CPP_CPLUSPLUS` macro
before including `utf8.h` and assign it a value for the standard you want to use - the values are the same as for the `__cplusplus` macro. This can be also useful with compilers that are conservative in setting the `__cplusplus` macro even if they have a good support for a recent standard edition - Microsoft's Visual C++ is one example.
### Checking if a file contains valid UTF-8 text
Here is a function that checks whether the content of a file is valid UTF-8 encoded text without reading the content into the memory:
```cpp
bool valid_utf8_file(const char* file_name)
{
ifstream ifs(file_name);
if (!ifs)
return false; // even better, throw here
istreambuf_iterator<char> it(ifs.rdbuf());
istreambuf_iterator<char> eos;
return utf8::is_valid(it, eos);
}
```
Because the function `utf8::is_valid()` works with input iterators, we were able to pass an `istreambuf_iterator` to it and read the content of the file directly without loading it to the memory first.
Note that other functions that take input iterator arguments can be used in a similar way. For instance, to read the content of a UTF-8 encoded text file and convert the text to UTF-16, just do something like:
```cpp
utf8::utf8to16(it, eos, back_inserter(u16string));
```
### Ensure that a string contains valid UTF-8 text
If we have some text that "probably" contains UTF-8 encoded text and we want to replace any invalid UTF-8 sequence with a replacement character, something like the following function may be used:
```cpp
void fix_utf8_string(std::string& str)
{
std::string temp;
utf8::replace_invalid(str.begin(), str.end(), back_inserter(temp));
str = temp;
}
```
The function will replace any invalid UTF-8 sequence with a Unicode replacement character. There is an overloaded function that enables the caller to supply their own replacement character.
## Points of interest
#### Design goals and decisions
The library was designed to be:
1. Generic: for better or worse, there are many C++ string classes out there, and the library should work with as many of them as possible.
2. Portable: the library should be portable both accross different platforms and compilers. The only non-portable code is a small section that declares unsigned integers of different sizes: three typedefs. They can be changed by the users of the library if they don't match their platform. The default setting should work for Windows (both 32 and 64 bit), and most 32 bit and 64 bit Unix derivatives. Support for post C++03 language features is included for modern compilers at API level only, so the library should work even with pretty old compilers.
3. Lightweight: follow the "pay only for w
没有合适的资源?快使用搜索试试~ 我知道了~
An open source C++ game engine..zip
共2000个文件
h:1123个
cpp:821个
txt:29个
需积分: 5 0 下载量 50 浏览量
2023-12-27
00:16:44
上传
评论
收藏 67.81MB ZIP 举报
温馨提示
An open source C++ game engine.
资源推荐
资源详情
资源评论
收起资源包目录
An open source C++ game engine..zip (2000个子文件)
xxhash.c 2KB
DirectXTexConvert.cpp 193KB
BC6HBC7.cpp 131KB
DirectXTexMipmaps.cpp 119KB
DXTConversions.cpp 94KB
DirectXTexDDS.cpp 88KB
DeviceVulkan.cpp 78KB
DirectXTexTGA.cpp 73KB
DebugRenderer.cpp 72KB
CommandEncoderImplVulkan.cpp 67KB
PixelConversions.cpp 64KB
Geometry.cpp 61KB
ImageUtils.cpp 60KB
Rasterizer.cpp 57KB
PropertyBaseWidget.cpp 57KB
ReflectionUtils.cpp 52KB
RenderPipeline.cpp 52KB
DeviceDX11.cpp 51KB
World.cpp 51KB
DirectXTexWIC.cpp 47KB
Device.cpp 47KB
RenderContext.cpp 46KB
DirectXTexUtil.cpp 46KB
OpenDdlUtils.cpp 45KB
PropertyWidget.cpp 44KB
SpatialSystem_RegularGrid.cpp 42KB
ExpressionAST.cpp 42KB
ExpressionASTTransforms.cpp 38KB
DirectXTexHDR.cpp 38KB
ImageFormat.cpp 37KB
StringBuilder.cpp 36KB
CommandEncoderImplDX11.cpp 36KB
DirectXTexResize.cpp 35KB
PropertyAttributes.cpp 35KB
BC.cpp 35KB
GameObject.cpp 34KB
ShadowPool.cpp 34KB
AbstractObjectGraph.cpp 33KB
snprintf.cpp 33KB
ConversionUtils.cpp 32KB
MaterialResource.cpp 31KB
DirectXTexD3D11.cpp 31KB
OpenDdlParser.cpp 30KB
StringUtils.cpp 30KB
ResourceCacheVulkan.cpp 29KB
DirectXTexCompress.cpp 29KB
TreeCommands.cpp 29KB
ShaderCompilerDXC.cpp 29KB
FileSystemModel.cpp 29KB
DocumentNodeManager.cpp 28KB
FileSystem.cpp 28KB
DocumentObjectManager.cpp 28KB
ResourceManager.cpp 28KB
Profiling.cpp 28KB
DirectXTexD3D12.cpp 27KB
ImageConversion.cpp 27KB
NodeScene.cpp 25KB
ExpressionParser.cpp 25KB
Inconsolata.cpp 25KB
LSAOPass.cpp 25KB
PropertyAnimComponent.cpp 25KB
Variant.cpp 25KB
PathComponent.cpp 25KB
GraphicsUtils.cpp 24KB
RenderWorld.cpp 24KB
WorldReader.cpp 24KB
ExpressionCompiler.cpp 24KB
SkeletonComponent.cpp 24KB
ClusteredDataExtractor.cpp 23KB
CameraComponent.cpp 22KB
DirectXTexMisc.cpp 22KB
QtProxy.cpp 22KB
BmpFileFormat.cpp 22KB
ImageCopyVulkan.cpp 22KB
DocumentObjectMirror.cpp 22KB
DirectXTexImage.cpp 22KB
SensorComponent.cpp 22KB
PropertyGridWidget.cpp 22KB
BCDirectCompute.cpp 21KB
OSFile_Win.cpp 21KB
MeshBufferResource.cpp 21KB
AndroidJni.cpp 20KB
ContainerWindow.cpp 20KB
ShaderStateDescriptor.cpp 20KB
DocumentObjectConverter.cpp 20KB
MotionMatchingComponent.cpp 20KB
GreyBoxComponent.cpp 20KB
DecalComponent.cpp 20KB
BlackboardAnimNodes.cpp 19KB
ReflectedTypeStorageAccessor.cpp 19KB
OpenDdlWriter.cpp 19KB
PipelineBarrierVulkan.cpp 19KB
DuktapeHelper.cpp 19KB
SimpleASCIIFont.cpp 19KB
DynamicOctree.cpp 18KB
BlackboardComponent.cpp 18KB
ArchiveUtils.cpp 18KB
AnimPoseGenerator.cpp 18KB
SwapChainVulkan.cpp 18KB
HeaderCheck.cpp 18KB
共 2000 条
- 1
- 2
- 3
- 4
- 5
- 6
- 20
资源评论
Lei宝啊
- 粉丝: 2001
- 资源: 1330
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功