String parsing

String parsing in C is painful and annoying.

I want to parse the components of an absolute POSIX path. Here’s the code to do it in several languages.

PHP:
<?php
foreach (explode(DIRECTORY_SEPARATOR, $path) as $component) {
    /* ... */
}
?>
Objective-C:
for (NSString *component in [path pathComponents]) {
    /* ... */
}
C using CoreFoundation:
CFArrayRef        components = CFStringCreateArrayBySeparatingStrings(kCFAllocatorDefault, path, CFSTR("/"));

for (CFIndex i = 0; i < CFArrayGetCount(components); ++i) {
    CFStringRef    component = CFArrayGetValueAtIndex(components, i);
    /* ... */
}
CFRelease(components);
Python:
for component in path.split(os.sep):
    # ...
bash:
save_IFS=$IFS
IFS=/
for component in $path; do
    # ...
done
IFS=$save_IFS

And finally…

C:
char        *pos = NULL, *component = NULL, *dpath = strdup(path);

for (component = strtok_r(dpath, "/", &pos); component; component = strtok_r(NULL, "/", &pos)) {
    /* ... */
}
free(dpath);

/****************** or ******************/
char         *component = NULL, *dpath = strdup(path);

for (component = strsep(&dpath, "/"); component; component = strsep(&dpath, "/")) {
    /* ... */
}
free(dpath);

These languages bring the concept in various degrees of readability and flexibility, but in my opinion, none is more confusing and less flexible than the pure C means. strtok_r() modifies its argument, for mercy’s sake, necessitating an annoying strdup() to avoid trampling on someone else’s data. Does this avoid a bunch more potentially costly memory allocations or an output size_t parameter? Yes. Does it make things any easier or clearer? No way.

The runner-up for most confusing, for me, is the bash script version. The semantics of variable expansion in bash are impressively complicated, especially that IFS variable.

After that, of course, comes the CoreFoundation version, which is only better than plain C inasmuch as it uses clearer language.

To my annoyance, the winner in readability and usability is Python’s version, coming in at one line of code to split the string and loop over the components.

Edit: It’s deliberate that each of these snippets, save the Objective-C one, will return an empty string for the leading / in an absolute path. There are times when treating the root directory as a separate entry in one’s loop is useful. At the filesystem level, for example. It’s interesting to note that Objective-C is the only language among these which provides a dedicated method explicitly for splitting up components of a path; all the others do it using a string operation and their best guess at the system’s path separator. Once again Python shines in that regard. Objective-C’s method is overkill, whereas Python has recognized that there are no systems out there pathological enough for the split(pathsep) paradigm not to work and not cluttered its library with an unnecessary function.

Edit 2: In fact, due to the quirks of strtok_r(), the C version as shown below will behave like the Objective-C version, throwing away the leading slash of an absolute path entirely. To get the described behavior in C, one would use strsep() instead. Also, a couple of the code snippets were wrong; I fixed them.

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>