# Fast C++ CSV Parser
This is a small, easy-to-use and fast header-only library for reading comma separated value (CSV) files.
## Features
* Automatically rearranges columns by parsing the header line.
* Disk I/O and CSV-parsing are overlapped using threads for efficiency.
* Parsing features such as escaped strings can be enabled and disabled at compile time using templates. You only pay in speed for the features you actually use.
* Can read multiple GB files in reasonable time.
* Support for custom columns separators (i.e. Tab separated value files are supported), quote escaped strings, automatic space trimming.
* Works with `*`nix and Windows newlines and automatically ignores UTF-8 BOMs.
* Exception classes with enough context to format useful error messages. what() returns error messages ready to be shown to a user.
## Getting Started
The following small example should contain most of the syntax you need to use the library.
```cpp
# include "csv.h"
int main(){
io::CSVReader<3> in("ram.csv");
in.read_header(io::ignore_extra_column, "vendor", "size", "speed");
std::string vendor; int size; double speed;
while(in.read_row(vendor, size, speed)){
// do stuff with the data
}
}
```
## Installation
The library only needs a standard conformant C++11 compiler. It has no further dependencies. The library is completely contained inside a single header file and therefore it is sufficient to copy this file to some place on your include path. The library does not have to be explicitly build.
Note however, that threads are used and some compiler (for example GCC) require you to link against additional librarie to make it work. With GCC it is important to add -lpthread as the last item when linking, i.e. the order in
```
g++ -std=c++0x a.o b.o -o prog -lpthread
```
is important. If you for some reason do not want to use threads you can define CSV_IO_NO_THREAD before including the header.
Remember that the library makes use of C++11 features and therefore you have to enable support for it (f.e. add -std=c++0x or -std=gnu++0x).
The library was developed and tested with GCC 4.6.1
Note that VS2013 is not C++11 compilant and will therefore not work out of the box. See [here](https://code.google.com/p/fast-cpp-csv-parser/issues/detail?id=6) for what needs to be adjusted to make the code work.
## Documentation
The libary provides two classes:
* `LineReader`: A class to efficiently read large files line by line.
* `CSVReader`: A class that efficiently reads large CSV files.
Note that everything is contained in the `io` namespace.
### `LineReader`
```cpp
class LineReader{
public:
// Constructors
LineReader(some_string_type file_name);
LineReader(some_string_type file_name, std::FILE*source);
LineReader(some_string_type file_name, std::istream&source);
LineReader(some_string_type file_name, std::unique_ptr<ByteSourceBase>source);
// Reading
char*next_line();
// File Location
void set_file_line(unsigned);
unsigned get_file_line(unsigned)const;
void set_file_name(some_string_type file_name);
const char*get_truncated_file_name()const;
};
```
The constructor takes a file name and optionally a data source. If no data source is provided the function tries to open the file with the given name and throws an `error::can_not_open_file exception` on failure. If a data source is provided then the file name is only used to format error messages. In that case you can essentially put any string there. Using a string that describes the data source results in more informative error messages.
`some_string_type` can be a `std::string` or a `char*`. If the data source is a `std::FILE*` then the library will take care of calling `std::fclose`. If it is a `std::istream` then the stream is not closed by the library. For best performance open the streams in binary mode. However using text mode also works. `ByteSourceBase` provides an interface that you can use to implement further data sources.
```
class ByteSourceBase{
public:
virtual int read(char*buffer, int size)=0;
virtual ~ByteSourceBase(){}
};
```
The read function should fill the provided buffer with at most `size` bytes from the data source. It should return the number of bytes actually written to the buffer. If data source has run out of bytes (because for example an end of file was reached) then the function should return 0. If a fatal error occures then you can throw an exception. Note that the function can be called both from the main and the worker thread. However, it is guarenteed that they do not call the function at the same time.
Lines are read by calling the `next_line` function. It returns a pointer to a null terminated C-string that contains the line. If the end of file is reached a null pointer is returned. The newline character is not included in the string. You may modify the string as long as you do not write past the null terminator. The string stays valid until the destructor is called or until next_line is called again. Windows and `*`nix newlines are handled transparently. UTF-8 BOMs are automatically ignored and missing newlines at the end of the file are no problem.
**Important:** There is a limit of 2^24-1 characters per line. If this limit is exceeded a `error::line_length_limit_exceeded` exception is thrown.
Looping over all the lines in a file can be done in the following way.
```cpp
LineReader in(...);
while(char*line = in.next_line()){
...
}
```
The remaining functions are mainly used used to format error messages. The file line indicates the current position in the file, i.e., after the first `next_line` call it is 1 and after the second 2. Before the first call it is 0. The file name is truncated as internally C-strings are used to avoid `std::bad_alloc` exceptions during error reporting.
**Note:** It is not possible to exchange the line termination character.
### `CSVReader`
`CSVReader` uses policies. These are classes with only static members to allow core functionality to be exchanged in an efficient way.
```cpp
template<
unsigned column_count,
class trim_policy = trim_chars<' ', '\t'>,
class quote_policy = no_quote_escape<','>,
class overflow_policy = throw_on_overflow,
class comment_policy = no_comment
>
class CSVReader{
public:
// Constructors
// same as for LineReader
// Parsing Header
void read_header(ignore_column ignore_policy, some_string_type col_name1, some_string_type col_name2, ...);
void set_header(some_string_type col_name1, some_string_type col_name2, ...);
bool has_column(some_string_type col_name)const;
// Read
char*next_line();
bool read_row(ColType1&col1, ColType2&col2, ...);
// File Location
void set_file_line(unsigned);
unsigned get_file_line()const;
void set_file_name(some_string_type file_name);
const char*get_truncated_file_name()const;
};
```
The `column_count` template parameter indicates how many columns you want to read from the CSV file. This must not necessarily coincide with the actual number of columns in the file. The three policies govern various aspects of the parsing.
The trim policy indicates what characters should be ignored at the begin and the end of every column. The default ignores spaces and tabs. This makes sure that
```
a,b,c
1,2,3
```
is interpreted in the same way as
```
a, b, c
1 , 2, 3
```
The trim_chars can take any number of template parameters. For example `trim_chars<' ', '\t', '_'> `is also valid. If no character should be trimmed use `trim_chars<>`.
The quote policy indicates how string should be escaped. It also specifies the column separator. The predefined policies are:
* `no_quote_escape<sep>` : Strings are not escaped. "`sep`" is used as column separator.
* `double_quote_escape<sep, quote>` : Strings are escaped using quotes. Quotes are escaped using two consecutive quotes. "`sep`" is used as column separator and "`quote`" as quoting character.
**Important**: When combining trimm
评论0