Discussion:
finding multiple patterns in log file
Ted Yu ted_yu-/E1597aS9LQAvxtiuMwx3w@public.gmane.org [seajug]
2014-05-14 18:26:05 UTC
Permalink
Hi,
Below is one sample pattern I want to search for in given log file:

staticString splitterPattern= "^(.*) INFO  (.*) wal.HLogSplitter ..."
static Pattern SPLITTER = Pattern.compile(splitterPattern);

There're 10 (or more) such patterns.
I am currently iterating through the patterns for each log line.

Is there a faster way to do this ?

Thanks
Scott Shipp scottashipp-1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org [seajug]
2014-05-14 18:38:23 UTC
Permalink
I'm not sure it's faster but you can write code that puts all the patterns together into one long pattern similar to this code below. Time it and see if in your case it is an improvement...or that it even works for you. It uses "|" to separate different patterns so if any of your regular expression already relies on a pipe, this will not work as expected.

In my case, this code was for finding name prefixes (Mr., Mrs., Sr. , Sra., etc.) and suffixes (Jr., Sr., II, III, etc.) and the resulting String representation of the pattern would look something like:

"(?: |,|^)(jr)(?: |,|$)|(?: |,|^)(jr.)(?: |,|$)|(?: |,|^)(sr.)(?: |,|$)|(?: |,|^)(sr)(?: |,|$)|(?: |,|^)(iii)(?: |,|$)|(?: |,|^)(iv)(?: |,|$)|"

I was also using the case insensitive flag.

---

final static String defaultBoundaryStart = "(?: |,|^)(";
final static String defaultBoundaryEnd = ")(?: |,|$)|";

Pattern createMatchPattern(String[] strings, String boundaryStart, String boundaryEnd) {
StringBuilder pattern = new StringBuilder();
for(int i = 0; i < strings.length; i++) {
pattern.append(boundaryStart + Pattern.quote(strings[i]) + boundaryEnd);
}
pattern.deleteCharAt(pattern.length()-1);
Pattern matchPattern = Pattern.compile(pattern.toString(), Pattern.CASE_INSENSITIVE);
return matchPattern;
}

Usage:

Pattern bigLongPattern = createMatchPattern(someArrayOfPatterns, defaultBoundaryStart, defaultBoundaryEnd);

To: seajug-***@public.gmane.org
From: seajug-***@public.gmane.org
Date: Wed, 14 May 2014 11:26:05 -0700
Subject: [seajug] finding multiple patterns in log file



























Hi,Below is one sample pattern I want to search for in given log file:








static String splitterPattern = "^(.*) INFO (.*) wal.HLogSplitter ..."







static Pattern SPLITTER = Pattern.compile(splitterPattern);
There're 10 (or more) such patterns.I am currently iterating through the patterns for each log line.
Is there a faster way to do this ?
Thanks
Eric Jain eric.jain-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org [seajug]
2014-05-14 19:19:48 UTC
Permalink
Post by Scott Shipp scottashipp-1ViLX0X+***@public.gmane.org [seajug]
"(?: |,|^)(jr)(?: |,|$)|(?: |,|^)(jr.)(?: |,|$)|(?: |,|^)(sr.)(?: |,|$)|(?: |,|^)(sr)(?: |,|$)|(?: |,|^)(iii)(?: |,|$)|(?: |,|^)(iv)(?: |,|$)|"
"(?: |,|^)([js]r.?|iii|iv)(?: |,|$)"

Could also just split on "[ ,]+", and then check a set of suffixes
(unless you really want '.' rather than '\.').
--
Eric Jain
Got data? Get answers at zenobase.com.


------------------------------------

Yahoo Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/seajug/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/seajug/join
(Yahoo! ID required)

<*> To change settings via email:
seajug-digest-***@public.gmane.org
seajug-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
seajug-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo Groups is subject to:
https://info.yahoo.com/legal/us/yahoo/utos/terms/
Scott Shipp scottashipp-1ViLX0X+lBJBDgjK7y7TUQ@public.gmane.org [seajug]
2014-05-14 19:26:18 UTC
Permalink
I like it! There are definitely better ways to write said reg-ex as a developer. I think code elsewhere did escape the "." (and other chars) at some point and use "\."
The idea behind this code was that there was a user-edited list which was assembled at run-time from a db query. So the choice was made to be rather crude about the reg-ex that the system constructed.
Scott

To: seajug-***@public.gmane.org
From: seajug-***@public.gmane.org
Date: Wed, 14 May 2014 12:19:48 -0700
Subject: Re: [seajug] finding multiple patterns in log file
Post by Scott Shipp scottashipp-1ViLX0X+***@public.gmane.org [seajug]
"(?: |,|^)(jr)(?: |,|$)|(?: |,|^)(jr.)(?: |,|$)|(?: |,|^)(sr.)(?: |,|$)|(?: |,|^)(sr)(?: |,|$)|(?: |,|^)(iii)(?: |,|$)|(?: |,|^)(iv)(?: |,|$)|"
"(?: |,|^)([js]r.?|iii|iv)(?: |,|$)"



Could also just split on "[ ,]+", and then check a set of suffixes

(unless you really want '.' rather than '\.').
--
Eric Jain

Got data? Get answers at zenobase.com.
Ted Yu ted_yu-/E1597aS9LQAvxtiuMwx3w@public.gmane.org [seajug]
2014-05-14 22:15:13 UTC
Permalink
In my case, the log lines have different structures.
If I chain the reg-ex'es together, how do I tell which reg-ex is actually used ?

Cheers


On Wednesday, May 14, 2014 12:26 PM, "Scott Shipp scottashipp-1ViLX0X+***@public.gmane.org [seajug]" <seajug-***@public.gmane.org> wrote:

 
I like it! There are definitely better ways to write said reg-ex as a developer. I think code elsewhere did escape the "." (and other chars) at some point and use "\."

The idea behind this code was that there was a user-edited list which was assembled at run-time from a db query. So the choice was made to be rather crude about the reg-ex that the system constructed. 

Scott



________________________________
To: seajug-***@public.gmane.org
From: seajug-***@public.gmane.org
Date: Wed, 14 May 2014 12:19:48 -0700
Subject: Re: [seajug] finding multiple patterns in log file

 
Post by Scott Shipp scottashipp-1ViLX0X+***@public.gmane.org [seajug]
"(?: |,|^)(jr)(?: |,|$)|(?: |,|^)(jr.)(?: |,|$)|(?: |,|^)(sr.)(?: |,|$)|(?: |,|^)(sr)(?: |,|$)|(?: |,|^)(iii)(?: |,|$)|(?: |,|^)(iv)(?: |,|$)|"
"(?: |,|^)([js]r.?|iii|iv)(?: |,|$)"

Could also just split on "[ ,]+", and then check a set of suffixes
(unless you really want '.' rather than '\.').
--
Eric Jain
Got data? Get answers at zenobase.com.
kingn@u.washington.edu [seajug]
2014-05-14 22:25:40 UTC
Permalink
Most of the answer you can find in tutorials.

You probably want to decide whether it's faster to try 10 separate patterns on
one line or use a pattern that has 10 ORs of them all together. The first
will be faster for long patterns, yes?


You might improve your performance by using java.nio.CharBuffer.
kingn@u.washington.edu [seajug]
2014-05-14 22:30:49 UTC
Permalink
example of using nio with regex:


'http://en.wikipedia.org/wiki/New_I/O'





New I/O - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/New_I/O

New I/O - Wikipedia, the free encyclopedia http://en.wikipedia.org/wiki/New_I/O New I/O (officially "Non-Blocking I/O",[1] and usually called simply NIO) is a collection of Java programming language APIs that offer features for intensive I/O operations. It was introduced with the J2SE 1.4 release of Java by Sun Microsystems to complement an existing...



View on en.wikipedia.org http://en.wikipedia.org/wiki/New_I/O
Preview by Yahoo
Eric Jain eric.jain-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org [seajug]
2014-05-15 03:17:34 UTC
Permalink
Post by Ted Yu ted_yu-/***@public.gmane.org [seajug]
If I chain the reg-ex'es together, how do I tell which reg-ex is actually used ?
Just create a separate capturing group for each pattern, e.g.

"(?: |,|^)(?:(jr.?)|(sr.?)|(iii)|(iv))(?: |,|$)"
--
Eric Jain
Got data? Get answers at zenobase.com.


------------------------------------

Yahoo Groups Links

<*> To visit your group on the web, go to:
http://groups.yahoo.com/group/seajug/

<*> Your email settings:
Individual Email | Traditional

<*> To change settings online go to:
http://groups.yahoo.com/group/seajug/join
(Yahoo! ID required)

<*> To change settings via email:
seajug-digest-***@public.gmane.org
seajug-fullfeatured-***@public.gmane.org

<*> To unsubscribe from this group, send an email to:
seajug-unsubscribe-***@public.gmane.org

<*> Your use of Yahoo Groups is subject to:
https://info.yahoo.com/legal/us/yahoo/utos/terms/
Dennis Sosnoski dms-WAiJhE/vqclWk0Htik3J/w@public.gmane.org [seajug]
2014-05-15 04:21:00 UTC
Permalink
It the patterns don't have much in common I don't know how much you'll
gain by merging them into one.

But you can do the pattern matching in parallel across multiple threads.
Use an ExecutorService as a thread pool, submit the individual pattern
matches as tasks and then wait for the full set of returned Futures to
complete. If your current code is compute bound this should make it
several times faster (but if it's IO bound it won't help).

If you have enough memory available your could also read the entire log
into memory and supply it as a collection of lines to the matching
threads. This would let you use the pattern matchers as filters in
multiple Java 8 streams of lines (all off the same backing collection).

- Dennis
Post by Ted Yu ted_yu-/***@public.gmane.org [seajug]
In my case, the log lines have different structures.
If I chain the reg-ex'es together, how do I tell which reg-ex is actually used ?
Cheers
On Wednesday, May 14, 2014 12:26 PM, "Scott Shipp
I like it! There are definitely better ways to write said reg-ex as a
developer. I think code elsewhere did escape the "." (and other chars)
at some point and use "\."
The idea behind this code was that there was a user-edited list which
was assembled at run-time from a db query. So the choice was made to
be rather crude about the reg-ex that the system constructed.
Scott
------------------------------------------------------------------------
Date: Wed, 14 May 2014 12:19:48 -0700
Subject: Re: [seajug] finding multiple patterns in log file
|,|$)|"
"(?: |,|^)([js]r.?|iii|iv)(?: |,|$)"
Could also just split on "[ ,]+", and then check a set of suffixes
(unless you really want '.' rather than '\.').
--
Eric Jain
Got data? Get answers at zenobase.com.
Loading...