File search using regex
Michael Cohen
michael.cohen at netspeed.com.au
Wed Apr 18 23:45:17 CST 2007
Ok, so i got to chip in:
#!/usr/bin/python
import sys
patterns = ["!D2\nInvoice\n!C\nAUSTALIA EIGHT",
"!D2\r\nInvoice\r\n!C\r\nAUSTALIA EIGHT"]
BUFFER_SIZE = 1024*1024
MAX_WINDOW_SIZE = 100
def search_file(filename):
fd = open(filename)
data = ''
while 1:
new_data = fd.read(BUFFER_SIZE)
if not new_data: break
## This is a sliding window
data = data[-MAX_WINDOW_SIZE:] + new_data
for pattern in patterns:
if pattern in data:
print "Pattern in file %s" % filename
return
for filename in sys.argv[1:]:
search_file(filename)
This does a sliding window and will match even when the pattern is split across
buffers. Also I prefer to use a single shot tool because it can be more
flexibly combined with shell like:
finder.py *.reports
or: find reports/ -atime -10 | xargs finder.py
etc
Depending on how accurtate the files are you may need to use res instead. The
above example does both unix and dos line ending.
Python is also IMHO much more readable than the perl example above (Duck..).
Especially a syntax like:
if pattern in data:...
is better than
if data =~ s/pattern/ {}
if you have never seen either of these languages before.
Michael.
On Wed, Apr 18, 2007 at 11:02:14PM +0930, Tim Wegener wrote:
> Hi Chris,
>
> On Wed, 2007-04-18 at 17:36 +0900, Chris Organ wrote:
> > Wondering if someone can assist with regex searches using perl / grep / anything command line. I have lost a severe amount of hair looking for a multi-line search based on exact pattern matching. What I need to do is find this exact header within files within a directory and list them, similar output to using grep -l ie a list.
>
> It looks like someone beat me to it, but since I've done it now, here is
> an alternative. (See attachment, with usage explained in the docstring.)
>
> Tim
>
> #!/usr/bin/env python
> """grep -l for a multi-line pattern.
>
> Usage:
> multi_grep.py <multi-line-pattern> filenames
>
> This will print out a list of filenames that matched the pattern.
> It will return 0 if there were any matches, 1 if none.
>
> Example:
> $ multi_grep.py '!D2\nInvoice' matching.txt nonmatching.txt
> matching.txt
>
> Note that \n is used to signify newlines.
> Each chunk that is not a newline must match the whole line to match.
>
> """
>
> __author__ = 'Tim Wegener <twegener at fastmail.fm>'
>
> import re
> import sys
>
> def main():
>
> pattern = sys.argv[1]
> filenames = sys.argv[2:]
>
> # Look for backslash+\n so that it can be entered from commmand line.
> parts = pattern.split(r'\n')
>
> if not parts:
> sys.exit(2)
>
> part_num = 0
> in_match = False
>
> found_any = False
>
> for filename in filenames:
> result = False
> for line in open(filename, 'r'):
> match = re.match('^'+parts[part_num]+'$', line)
> if match:
> part_num += 1
> if part_num == len(parts):
> result = True
> break
> in_match = True
> elif in_match:
> in_match = False
> part_num = 0
> if result:
> found_any = True
> part_num = 0
> print filename
>
> sys.exit(int(not found_any))
>
>
> if __name__ == '__main__':
> main()
> --
> LinuxSA WWW: http://www.linuxsa.org.au/ IRC: #linuxsa on irc.freenode.net
> To unsubscribe or change your options:
> http://www.netcraft.com.au/mailman/listinfo/linuxsa
More information about the linuxsa
mailing list