Archive for December, 2010

Extracting strings from MySQL text columns

Today, I was working with a MySQL database that contained a column of text that I wanted to parse out into multiple other columns so I could query on them. The first thing that came to mind was why not use regular expressions to extract out the needed bits from the column, but unfortunately, MySQL does not support this use of regular expressions nor regular expression matching at all on InnoDB tables (which is what this was). I was able to accomplish this task using the following SQL:

To extract the first instance of a given piece of text between string start and string end in string content:

For example, start=”www.”, end=”.com”, content=”www.facebook.com links to www.google.com” and I want to select out “facebook”:

SUBSTRING( 
MAX( content ) , 
LOCATE(start,content)+LENGTH(start), 
LOCATE(end, content, LOCATE(start, content)+LENGTH(start)) - LOCATE(start, content) - LENGTH(start)
)

For example:

mysql> SELECT
    -> @content:="www.facebook.com links to www.google.com",
    -> @start:="www.",
    -> @finish:=".com",
    -> SUBSTRING( 
    -> @content , 
    -> LOCATE(@start, @content)+LENGTH(@start), 
    -> LOCATE(@finish, @content, LOCATE(@start, @content)+LENGTH(@start)) - LOCATE(@start, @content)- LENGTH(@start)
    -> ) AS extracted;
+------------------------------------------------------+----------------+-----------------+-----------+
| @content:="www.facebook.com links to www.google.com" | @start:="www." | @finish:=".com" | extracted |
+------------------------------------------------------+----------------+-----------------+-----------+
| www.facebook.com links to www.google.com             | www.           | .com            | facebook  | 
+------------------------------------------------------+----------------+-----------------+-----------+
1 row in set (0.00 sec)

And of course when creating a view, its always best to cast the columns to the proper types so they can be effectively queried, which is as easy as:

CAST( col AS type)

For example:

mysql> SELECT @text:="  0438t", CAST( @text AS DECIMAL);
+------------------+-------------------------+
| @text:="  0438t" | CAST( @text AS DECIMAL) |
+------------------+-------------------------+
|   0438t          |                     438 | 
+------------------+-------------------------+
1 row in set (0.00 sec)

I will be the first to admit this is not the most efficient use of the database, but it works (and I can cast the values) and it can easily be queried when used as a view.