Discussion:
Regular Expression Word Boundaries
Scott Shipp
2014-01-23 18:12:14 UTC
Permalink
I need some advice on how to do something better. Right now I do a search for a word in a string but only if it is its own word. For example the name suffix, "Jr." One regular expression for it is "\\bJr.\\b" but I find that this does not match when "Jr." is either at the beginning or the end of a line. So I end up making the regular expression "\\bJr.\\b|^Jr.\\b|\\bJr.$" to catch all cases. Basically, this is saying Jr. as a word between two other words OR Jr. at the start of a line before another word or Jr. at the end of a line after another word.

This seems clumsy to me. Is there a more elegant way to catch these cases?

Scott
Stewart Buskirk
2014-01-23 18:37:27 UTC
Permalink
Try:

(\W|^)Jr\.(\W|$)

...or to include both "Jr" and "Jr.":

(\W|^)Jr\.?(\W|$)

...or to make case and period insensitive:

(\W|^)[J|j][R|r]\.?(\W|$)

Fun test tool (not Java specific, though):

http://regexpal.com/

-Stewart
Post by Scott Shipp
I need some advice on how to do something better. Right now I do a search for a word in a string but only if it is its own word. For example the name suffix, "Jr." One regular expression for it is "\\bJr.\\b" but I find that this does not match when "Jr." is either at the beginning or the end of a line. So I end up making the regular expression "\\bJr.\\b|^Jr.\\b|\\bJr.$" to catch all cases. Basically, this is saying Jr. as a word between two other words OR Jr. at the start of a line before another word or Jr. at the end of a line after another word.
This seems clumsy to me. Is there a more elegant way to catch these cases?
Scott
Konstantin Ignatyev
2014-01-23 18:39:48 UTC
Permalink
Did you look at
JParsec https://github.com/abailly/jparsec/wiki/Overview
or ANTLR https://github.com/abailly/jparsec/wiki/Overview
?

Real parsers could be a bit easier to deal with than complex regex-es.
Post by Scott Shipp
I need some advice on how to do something better. Right now I do a
search for a word in a string but only if it is its own word. For example
the name suffix, "Jr." One regular expression for it is "\\bJr.\\b" but I
find that this does not match when "Jr." is either at the beginning or the
end of a line. So I end up making the regular expression
"\\bJr.\\b|^Jr.\\b|\\bJr.$" to catch all cases. Basically, this is saying
Jr. as a word between two other words OR Jr. at the start of a line before
another word or Jr. at the end of a line after another word.
This seems clumsy to me. Is there a more elegant way to catch these cases?
Scott
--
Konstantin Ignatyev

PS: If this is a typical day on planet Earth, humans will add fifteen
million tons of carbon to the atmosphere, destroy 115 square miles of
tropical rainforest, create seventy-two miles of desert, eliminate between
forty to one hundred species, erode seventy-one million tons of topsoil,
add 2,700 tons of CFCs to the stratosphere, and increase their population
by 263,000

Bowers, C.A. The Culture of Denial: Why the Environmental Movement Needs a
Strategy for Reforming Universities and Public Schools. New York: State
University of New York Press, 1997: (4) (5) (p.206)
Eric Jain
2014-01-23 19:14:15 UTC
Permalink
Post by Scott Shipp
This seems clumsy to me. Is there a more elegant way to catch these cases?
"\\bJr\\.?\\b"
--
Eric Jain
zenobase.com -- What do you want to track today?


------------------------------------

Yahoo Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/seajug/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/seajug/join
(Yahoo! ID required)

<*> To change settings via email:
seajug-digest-***@public.gmane.org
seajug-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
seajug-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo Groups is subject to:
http://info.yahoo.com/legal/us/yahoo/utos/terms/
Scott Shipp
2014-01-23 22:59:36 UTC
Permalink
I combined Eric Jain's and Stewart Buskirk's responses to get this regex, which is working well for my particular data: (?i)(\\b|^)jr\\.?(\\b|$)

Thanks guys!

My big takeaway: if you want to one or the other boundary matchers put them in parentheses with a pipe at the spot you want to mark. I had actually attempted something like this previously with square brackets and so there you go, that's why I had resorted to the longhand method I mentioned in the first post.

Thanks to everyone else who replied. I may get into the ANTLR / GATE / OpenNLP stuff but this does not seem necessary yet for the particular software I'm working on. Good to know what's out there, though!

Scott

To: seajug-***@public.gmane.org
From: eric.jain-***@public.gmane.org
Date: Thu, 23 Jan 2014 11:14:15 -0800
Subject: Re: [seajug] Regular Expression Word Boundaries
Post by Scott Shipp
This seems clumsy to me. Is there a more elegant way to catch these cases?
"\\bJr\\.?\\b"
--
Eric Jain

zenobase.com -- What do you want to track today?
Jason Osgood
2014-01-23 20:23:41 UTC
Permalink
I need some advice on how to do something better. Right now I do a search for a word in a string but only if it is its own word. For example the name suffix, "Jr." One regular expression for it is ...
This seems clumsy to me. Is there a more elegant way to catch these cases?
Guessing you’re scrapping names from plain text.

I’ve been doing (too much) screen scrapping lately. The upside is learning regexs a lot better. But the work still sucks.

A current task is gathering the names of people who have testified, their organizations, and positions (for, against, neutral). Hoping to make my scrapers more robust, I’ve been reading about text extraction and playing with the tools. I don’t know yet if they’ll help me, but thought I’d mention it.

http://en.wikipedia.org/wiki/Named-entity_recognition

http://opennlp.apache.org


Cheers, Jason
Continue reading on narkive:
Loading...