File search using regex

Mark Newton newton at atdot.dotat.org
Thu Apr 19 11:35:47 CST 2007


On Thu, Apr 19, 2007 at 10:35:16AM +1000, Michael Cohen wrote:
 
 >   dont mistake wordy for inefficient - we would rather type a few more
 >   characters in to make our code more readable - its probably just as
 >   efficient though.

You're hand-rolling all kinds of loops and buffer management stuff
in those python scripts.

My example uses perl's optimized-by-a-cast-of-thousands built-in
facilities to do the same thing. 

You might be able to specially engineer contrived test cases where
the other examples which have been posted could be as efficient as
my 6-liner, but in the real world I suspect we'd come out even at
worst, with mine blowing the doors off yours at best.

There's also the maintainability advantages of a program which
took more than 2 minutes to develop vs a program that's so short
that it'll probably never need to be debugged.

 >   Compare your example below and have someone that has never
 >   programmed in perl read it.

I have no intention of writing perl code which someone who has
never programmed in perl can read.

Just like I have no intention of writing English which someone 
who has never encountered the English language can read.

There should be an expectation that anyone programming in -any-
language has invested the minimal amount of effort required to
be familiar with that language's basic features and syntax.

 > I used to write heaps of perl, but i havent done any for a while now -
 > and could not remember exactly what $_, $/ mean

$_ is used in almost every perl program.  Larry described it
as the perl equivalent of "it".

In English we can say, "Pick up the read pencil and put it in
the bin."  In most programming languages, you'd have to say,
"Pick up the red pencil and put the red pencil in the bin."
In perl, $_ means "it", and is roughly equivalent to, "that
item of data I'm trying to work on at the moment."

$/ is the record separator.  I'm setting it to the null value.
The <filehandle> syntax in scalar context causes a read of one
record from the filehandle.  Usually the record separator is
"\n", which means <filehandle> fetches one line.  By setting it
(that word again!) to null, I'm causing <filehandle> to slurp
the entire file.

Both of these are documented in "perldoc perlvar", which is
one of those docs which every new perl programmer is (should be?)
introduced to in lesson number 1.

 > BTW does your code handle the sliding window problem?

It handles it by making it not exist :-)

 >   The following questions come to my mind when reading this (bear
 > in mind I forgot most of my perl):
 >   
 >   Does grep read lines or fixed sized buffers? (It must read buffers
 >   i suppose or you program would not work because the pattern is multi
 >   lined).   How  large are the buffers then?

grep doesn't read anything.  You give it a regex as its first arg,
and a list as its second arg.

It runs the regex against all the scalars in the list, and returns 
another list which contains the members which matched.

In this instance, the list I'm providing in the second arg is
<FH>, which, as described above, contains a single scalar element
which contains an entire file.

The regex is anchored with ^, so what I'm effectively doing is
using grep as a boolean operator which returns TRUE if the file
begins with the required text.

"perldoc -f grep"

 >   If it reads lines what is the line size? If it comes across a
 > 10GB file with no \n what happens? 

Regardless of whether there are any \n's, if it comes across
a 10 Gbyte file it'll map it into a 10 Gbyte scalar.  

One line of code later that scalar will be unreferenced, and 
will be picked up by perl's garbage collection.  Which, as pointed
out above, has been debugged and optimized by a cast of gazillions.

  - mark

--------------------------------------------------------------------
I tried an internal modem,                    newton at atdot.dotat.org
     but it hurt when I walked.                          Mark Newton
----- Voice: +61-4-1620-2223 ------------- Fax: +61-8-82231777 -----


More information about the linuxsa mailing list