Regular Expression Word Boundaries

Discussion:

Scott Shipp

2014-01-23 18:12:14 UTC

I need some advice on how to do something better. Right now I do a search for a word in a string but only if it is its own word. For example the name suffix, "Jr." One regular expression for it is "\\bJr.\\b" but I find that this does not match when "Jr." is either at the beginning or the end of a line. So I end up making the regular expression "\\bJr.\\b|^Jr.\\b|\\bJr.$" to catch all cases. Basically, this is saying Jr. as a word between two other words OR Jr. at the start of a line before another word or Jr. at the end of a line after another word.

This seems clumsy to me. Is there a more elegant way to catch these cases?

Scott

Stewart Buskirk

2014-01-23 18:37:27 UTC

Permalink

Try:

(\W|^)Jr\.(\W|$)

...or to include both "Jr" and "Jr.":

(\W|^)Jr\.?(\W|$)

...or to make case and period insensitive:

(\W|^)[J|j][R|r]\.?(\W|$)

Fun test tool (not Java specific, though):

http://regexpal.com/

-Stewart

Post by Scott Shipp
I need some advice on how to do something better. Right now I do a search for a word in a string but only if it is its own word. For example the name suffix, "Jr." One regular expression for it is "\\bJr.\\b" but I find that this does not match when "Jr." is either at the beginning or the end of a line. So I end up making the regular expression "\\bJr.\\b|^Jr.\\b|\\bJr.$" to catch all cases. Basically, this is saying Jr. as a word between two other words OR Jr. at the start of a line before another word or Jr. at the end of a line after another word.
This seems clumsy to me. Is there a more elegant way to catch these cases?
Scott

Konstantin Ignatyev

2014-01-23 18:39:48 UTC

Permalink

Did you look at
JParsec https://github.com/abailly/jparsec/wiki/Overview
or ANTLR https://github.com/abailly/jparsec/wiki/Overview
?

Real parsers could be a bit easier to deal with than complex regex-es.

Post by Scott Shipp
I need some advice on how to do something better. Right now I do a
search for a word in a string but only if it is its own word. For example
the name suffix, "Jr." One regular expression for it is "\\bJr.\\b" but I
find that this does not match when "Jr." is either at the beginning or the
end of a line. So I end up making the regular expression
"\\bJr.\\b|^Jr.\\b|\\bJr.$" to catch all cases. Basically, this is saying
Jr. as a word between two other words OR Jr. at the start of a line before
another word or Jr. at the end of a line after another word.
This seems clumsy to me. Is there a more elegant way to catch these cases?
Scott

--
Konstantin Ignatyev

PS: If this is a typical day on planet Earth, humans will add fifteen
million tons of carbon to the atmosphere, destroy 115 square miles of
tropical rainforest, create seventy-two miles of desert, eliminate between
forty to one hundred species, erode seventy-one million tons of topsoil,
add 2,700 tons of CFCs to the stratosphere, and increase their population
by 263,000

Bowers, C.A. The Culture of Denial: Why the Environmental Movement Needs a
Strategy for Reforming Universities and Public Schools. New York: State
University of New York Press, 1997: (4) (5) (p.206)

Eric Jain

2014-01-23 19:14:15 UTC

Permalink

Post by Scott Shipp
This seems clumsy to me. Is there a more elegant way to catch these cases?

"\\bJr\\.?\\b"
--
Eric Jain
zenobase.com -- What do you want to track today?

------------------------------------

Yahoo Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/seajug/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/seajug/join
(Yahoo! ID required)

<*> To change settings via email:
seajug-digest-***@public.gmane.org
seajug-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
seajug-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo Groups is subject to:
http://info.yahoo.com/legal/us/yahoo/utos/terms/

Scott Shipp

2014-01-23 22:59:36 UTC

Permalink

I combined Eric Jain's and Stewart Buskirk's responses to get this regex, which is working well for my particular data: (?i)(\\b|^)jr\\.?(\\b|$)

Thanks guys!

My big takeaway: if you want to one or the other boundary matchers put them in parentheses with a pipe at the spot you want to mark. I had actually attempted something like this previously with square brackets and so there you go, that's why I had resorted to the longhand method I mentioned in the first post.

Thanks to everyone else who replied. I may get into the ANTLR / GATE / OpenNLP stuff but this does not seem necessary yet for the particular software I'm working on. Good to know what's out there, though!

Scott

To: seajug-***@public.gmane.org
From: eric.jain-***@public.gmane.org
Date: Thu, 23 Jan 2014 11:14:15 -0800
Subject: Re: [seajug] Regular Expression Word Boundaries

Post by Scott Shipp
This seems clumsy to me. Is there a more elegant way to catch these cases?

"\\bJr\\.?\\b"

--
Eric Jain

zenobase.com -- What do you want to track today?

Jason Osgood

2014-01-23 20:23:41 UTC

Permalink

Guessing youre scrapping names from plain text.

Ive been doing (too much) screen scrapping lately. The upside is learning regexs a lot better. But the work still sucks.

A current task is gathering the names of people who have testified, their organizations, and positions (for, against, neutral). Hoping to make my scrapers more robust, Ive been reading about text extraction and playing with the tools. I dont know yet if theyll help me, but thought Id mention it.

http://en.wikipedia.org/wiki/Named-entity_recognition

http://opennlp.apache.org

Cheers, Jason

Continue reading on narkive:

Search results for 'Regular Expression Word Boundaries' (Questions and Answers)

replies

JavaScript full word recognized only?

started 2009-11-21 14:17:13 UTC

programming & design

replies

How computer might count the words?

started 2012-01-05 22:15:24 UTC

programming & design

replies

Regular Expression Help please!?