LinuxSA Mailing list archives

Index: [thread] [date] [subject] [author] [stats]
  From: Andrew Hill <list@fornax.net>
  To  : Dan Kortschak <dan.kortschak@adelaide.edu.au>
  Date: Tue, 20 Nov 2001 15:14:25 +1030

Re: regular expression help

Hmmmm, regular expressions and DNA. I know that I'm going to regret this 
... :-)

Dan Kortschak wrote:

> My first question is how do I ask the RE engine to
> return the longest $1 match (I remember seeing this in a manual somewhere, but I
> can't find it again.


It would depend on the RE engine. Perl? By default, Perl will always 
*try* to find the longest leftmost match. However, regexes will go for 
leftmost *before* longest, and as Perl is a Traditional NFA engine, you 
may not always get the longest match. It's a regex engine issue.

See http://www.fornax.net/regex2/index.html for my lecture to LinuxSA 
that covered matching and longest/leftmost issues. It might help.

 
> The second question is how do I specify a string of triplets /(?:\S\S\S)*/


Hmmm, your terminilogy here is a bit odd. This is a regex that is doing 
  a match for a string of 3 non-whitespace characters, zero or more 
times, without storing the matching characters in a variable. (Okay, I 
guess you can call that a string of triplets, but to me, a string of 
triplets is a string that has groups of threes in it it, like "TTTAAAGGG".)

> that
> does not contain /(?:TAA|TAG|TGA)/?


You can use negative lookahead for the NOT part. This will search to 
make sure the next characters do NOT match what you have asked for, but 
without "consuming" characters for the matching. But you knew that... 
It's the next bit that's ticky, right?

> If it were a single character, to avoid it
> would be easy, but I can't see a way to get this with frame-wise triplets of
> characters?


How aboout:

$seq = "TAG" . "AAA" . "TTT" . "GGG" . "TAG" . "AAA";
if ($seq =~ 
m/(?:\S\S\S)*?((?!(TAA|TAG|TGA))(?:\S\S\S)*)(?=(TAA|TAG|TGA))/) {
   print "$1\n";
}

Hmmm, this is not perfect, as it assumes that there will be a TAA, TAG 
or TGA sequence somewhere down the sequence from your searching point. 
But you get the idea...

(Start off by matching zero instances of 3 non-whitespace characters, 
then make sure the next three characters are not TAA, TAG or TGA [if 
they are, backtrack and match one instance of 3 non-whitespace 
characters, and so on until you find a starting point that isn't one of 
these 3 base sequences. Then match lots of 3 base sequences up to the 
point where the next 3 base sequence would be one of the forbidden ones.)



HTH,

-- 
Andrew Hill
"RAID - Don't believe the hype." -- 2001-09-22

-- 
LinuxSA WWW: http://www.linuxsa.org.au/  IRC: #linuxsa on irc.linux.org.au
To unsubscribe from the LinuxSA list:
  mail linuxsa-request@linuxsa.org.au with "unsubscribe" as the subject


Index: [thread] [date] [subject] [author] [stats]
Return to the LinuxSA Mailing List Information Page