欢迎您访问程序员文章站本站旨在为大家提供分享程序员计算机编程知识!
您现在的位置是: 首页  >  数据库

Customized MySQL LOAD DATA LOCAL INFILE handlers with libmys_MySQL

程序员文章站 2024-01-01 21:32:52
...
One of the lesser known (and used) MySQL Client API calls seems to bemysql_set_local_infile_handler()which allows to override the defaultLOAD DATA LOCAL INFILEbehavior on the client side, allowing to import data in text form, e.g. in CSV format, from other sources than actual local files. I searched for some more detailed info than the reference page in the manual for this, or some example code using this mechanism, but couldn't really find any, so this post is going to try to close this gap.

This allows for things similar to PostgreSQL'sCOPY FROM STDINmechanism in a way, there is a substantial difference between the two DMBS's approaches here: withCOPY FROM STDINan application enters a sort of push mode where text data can be sent using specific library functions the mysql client library API implements a pull model instead where the client library takes control and requests application data via specific callbacks that need to be registered up front withmysql_set_local_infile_handler()before executing aLOAD DATA INFILEquery.

Setting things up

mysql_set_local_infile_handler()expects four callback function pointers, one for initialization, one for fetching a single line of text, one for cleaning up, and an error handler. It also expects a mysql connection handle as first parameter, and a pointer to user defined per-connection data that will be passed to the init function later.

This data pointer may point to data common to all invocations of the custom handler. In the example code I'm just using it to point to a descriptive name.

Call flow

The call flow for the various infile callbacks triggered while processing aLOAD DATA INFILEquery looks like this, with the upper part being the regular case:init()being called once, thenread()repeatedly until it can't provide any more data, and finally theend()callback. The lower paths show the error handling case that is triggered byinit()returning a non-zero value, or byread()returning a negative lenght to indicate an error. In this case theerror()callback is called, followed by a call toend().

+------ 0	|+--------+|+--------+ |+-------+| init() |-+- == 0 -+->| read() |-+- == 0 ---------->| end() |+--------+ | +--------+ | ^+-------+		|!= 0 ---------+-->| error() |-+	+---------+

init()

int local_infile_init(void **instance_data, const char *filename, void *handler_data)

The init function is called first whenever aLOAD DATA LOCALquery is issued. It receives a pointer-pointer where it can store the pointer to local state data to be used for thisLOADoperation, the filename used in theLOADstatement, and the per-connection user data pointer that was passed intomysql_set_local_infile_handler()earlier.

In the example code below I'm keeping track of line numbers via the user data pointer.

read()

int local_infile_read(void *instance_data, char *buf, unsigned int buf_len)

Theread()function gets called repeatedly until no more data is available. It receives the instance data pointer you stored ininit(), a pointer to a buffer to storeINFILE data into, and the max. amount of bytes that can be put into this buffer.

You don't have to pass a complete single line of data at a time, you can pass multiple input lines, or just part of a line, you just have to make sure not to exceed the buffer size. Data passed from theread()handler will simply be transfered over to the server, and all parsing, including splitting it into lines, happens on that side.

Theread()handler shall return the number of bytes that have been put into the buffer. A value of zero indicates that all data has been read and thatread()should not be called again. A negative value indicates an error and will also terminate reading. There is no way to generate warnings here, just errors.

error()

int local_infile_error(void *instance_data, char *error_msg, unsigned int error_msg_len)

Theerror()handler is called after returning a non-zero value frominit()or a negative number fromread(). It receives the data pointer you've set up ininit()and a buffer pointer plus length to write an error message to. A numeric error code can be passed as the return value.

There's no direct way to set error number and error code when hitting an error ininit()orread()right away. If you need to pass on something descriptive from where the error happened to theerror()handler you have to take care of doing so using the data pointer (or via global variables).

void local_infile_end(void *instance_data)

Theend()handler is called afterread()has returned a zero length to indicate "end of data", or right after the error handler. Its sole purpose is to free any resources that you may have allocated ininit().

Summary

While local infile handlers allow for some interesting alternatives to simple client side local file imports the current implementation feels a bit complicated and not like a good fit for applications that want to avoid the SQL parsing overhead on bulk imports, but want / need to drive the process instead of passing control to the client library.

To summarize this in an itemized list:

  • Pull model works for reading data from a different stream than a simple local file, but not so much for applications that just want to avoid the SQL parsing overhead (and its synchronous nature) on importing bulk data, but are a better match for a push model like the one the PostgreSQL PQ library is using
  • No insight into theLOAD DATAstatement or the related table / column meta data apart of the file name parameter
  • No flow control in the protocol beyond what TCP provides. There's especially no way to keep the connection alive if the input stream stalls for longer thannet_read_timeout
  • Error handling: I'd personally prefer the more direct way error handling is done in the UDF (User Defined Functions) API on the server side where all callbacks that may want to report an error can do so via an extra error message buffer passed as a parameter right away, but it is probably way too late to complain about this some ten years after the fact ...

Protocol wise the PostgreSQL and MySQL implementations are not that different though, so maybe an alternative call interface similar to thePQputCopyData()/PQputCopyEnd()approach inlibpqtolibmysqlclientcould be a nice addition to support both the push and pull approaches. ( ... to be continued ...)

Example

The following example code can be compiled using

gcc `mysql_config --cflags` infile_handler.c -o infile_handler `mysql_config --libs`

上一篇:

下一篇: