Strip special chars for any language

November 23, 2014 at 5:29 pm

Many scripts use different functions to produce slugifed part of URL from post/article/page title.
Usually such functions work well until this is some rare language with specific special characters.
After searching the net I found one really working solution, tested on clients sites with such problems (with eg. Turkish language). The function is short but really efficient.

function slugify($text,$strict = true) {
        $text = html_entity_decode($text, ENT_QUOTES, 'UTF-8');
        // replace non letter or digits by -
        $text = preg_replace('~[^\\pL\d.]+~u', '-', $text);

        // trim
        $text = trim($text, '-');
        setlocale(LC_CTYPE, 'en_GB.utf8');
        // transliterate
        if (function_exists('iconv')) {
           $text = iconv('utf-8', 'us-ascii//TRANSLIT', $text);
        }

        // lowercase
        $text = strtolower($text);
        // remove unwanted characters
        $text = preg_replace('~[^-\w.]+~', '', $text);
        if (empty($text)) {
           return 'empty_$';
        }
        if ($strict) {
            $text = str_replace(".", "_", $text);
        }
        return $text;
    }

With function above it’s easy now to convert any title to part of url:

$url_part = slugify($title);