Unix Technical Forum

finding similar text in files in various subdirectories

This is a discussion on finding similar text in files in various subdirectories within the Slackware Linux Support forums, part of the Unix Operating Systems category; --> I don't know where to ask this so I ask it here. I hope noboyd minds as I have ...


Go Back   Unix Technical Forum > Unix Operating Systems > Slackware Linux Support

FAQ Members List Calendar Search Today's Posts Mark Forums Read
  #1 (permalink)  
Old 02-20-2008, 02:03 PM
anonymous
 
Posts: n/a
Default finding similar text in files in various subdirectories

I don't know where to ask this so I ask it here. I hope noboyd minds as
I have found the folks here very helpful before. Thanks for all your
help.

To my current question: I have multiple subdirectories within a single
directory. Each subdirectory contains multiple text files with varying
names. How do I find files which are very
closely similar to each other. Yes, the text files are C source code
files.


Any simple and good script or a command line utility for this?

Thanks in advance.

A.

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #2 (permalink)  
Old 02-20-2008, 02:03 PM
Dan C
 
Posts: n/a
Default Re: finding similar text in files in various subdirectories

On Mon, 16 Jan 2006 05:38:56 -0800, anonymous wrote:

> I don't know where to ask this so I ask it here. I hope noboyd minds as
> I have found the folks here very helpful before.


Really? That isn't usually the case with Windoze users posting from
Google Groups...

> To my current question: I have multiple subdirectories within a single
> directory. Each subdirectory contains multiple text files with varying
> names. How do I find files which are very closely similar to each other.
> Yes, the text files are C source code files.


A Linux user might use 'grep' and/or 'diff'. Not sure what you can use.

man grep
man diff

--
If you're not on the edge, you're taking up too much space.
Linux Registered User #327951

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #3 (permalink)  
Old 02-20-2008, 02:03 PM
No_One
 
Posts: n/a
Default Re: finding similar text in files in various subdirectories

On 2006-01-16, anonymous <call_ret@yahoo.com> wrote:
>
> To my current question: I have multiple subdirectories within a single
> directory. Each subdirectory contains multiple text files with varying
> names. How do I find files which are very
> closely similar to each other. Yes, the text files are C source code
> files.
>
>
> Any simple and good script or a command line utility for this?


The problem is that it's not a simple task. If you looking for files with
contents similar to other files, or strings in one file that are similar to
strings in another file....good luck... if a==c is the same as if c==a,
same thing, different strings.

If you're looking for identical functions across various c file...ctags or
etags would probably help.

Or....I'm not sure, but you could try something like swish-e indexing program
that uses an agrep function which will find similar matches or fuzzy matches,
again that's assuming I remember correctly.

ken

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #4 (permalink)  
Old 02-20-2008, 02:04 PM
No_One
 
Posts: n/a
Default Re: finding similar text in files in various subdirectories

On 2006-01-16, No_One <no_one@no_where.com> wrote:
> On 2006-01-16, anonymous <call_ret@yahoo.com> wrote:
>
> Or....I'm not sure, but you could try something like swish-e indexing program
> that uses an agrep function which will find similar matches or fuzzy matches,
> again that's assuming I remember correctly.


A follow up to my own post....swish doesn't use the agrep algorithm, it's
glimpse that uses the agrep algorithm....

ken
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #5 (permalink)  
Old 02-20-2008, 02:07 PM
klee12
 
Posts: n/a
Default Re: finding similar text in files in various subdirectories

anonymous wrote:
>
> To my current question: I have multiple subdirectories within a single
> directory. Each subdirectory contains multiple text files with varying
> names. How do I find files which are very
> closely similar to each other. Yes, the text files are C source code
> files.
>


This reminds me of a problem computer science professors had in
trying to determine if students were cheating by turning in copied
programs (sorry, it was long ago and I don't have references). Students
were changing variable names, comments, and order of statements to try
to hide the cheating. Possibly the professors had a C parsr and could
determine simularities. If I had the same problem and didn't have a
problem, I might create metrics of (1)count the number of of
statements and subroutines in a file in a file (2) count of number of
different type (like how many if, for, etc statments) for each
subroutine,(3) call tree structure of each file. I'm sure you can
think of more metrics. If two files had the same or very similar
metricis I would examine them more closely.

If detecting cheating isn't your problem, perhaps creating some sort of
metrics might help.

Hope that helps

klee12

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #6 (permalink)  
Old 02-20-2008, 02:07 PM
klee12
 
Posts: n/a
Default Re: finding similar text in files in various subdirectories

Ooops several types. 3 sentence in main paragraph should read something
like:

If I had the same problem and didn't have a
**parser** I might create metrics **like** (1)count the number of of
statements and subroutines in a file (2) count of number of
different ** types of statements ** (like how many if, for, etc
statments) for each
subroutine,(3) call tree structure of each file.

Sorry

klee12

Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #7 (permalink)  
Old 02-20-2008, 02:07 PM
doc
 
Posts: n/a
Default Re:finding similar text in files in various subdirectories

i guess a crude solution might be simply to compare each file with
every other, and send the output of diff to wc, logging the results
and grouping files with diff|wc below a given threshhold. if
"very closely similar" means slightly edited versions of
the same original file, this would probably do the trick.

i'm sure a bash script could do all this (but i'm not going to write
it!).

Message posted via:
=====================
www.linuxpackages.net/forum
www.linuxpackages.net
Expanding the world of Slackware
=====================
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
  #8 (permalink)  
Old 02-20-2008, 02:10 PM
Thomas Overgaard
 
Posts: n/a
Default Re: finding similar text in files in various subdirectories


anonymous wrote :

> Any simple and good script or a command line utility for this?


I was out surfing for something else when I tumbled across this
utility named 'finddouble 1.4', maybe it can be used:
<URL: http://bsegonnes.free.fr/en_projects.html>

The program comes both as source and as a bash script so there's
something to start with.
--
Thomas O.

This area is designed to become quite warm during normal operation.
Digg this Post!Add Post to del.icio.usBookmark Post in TechnoratiFurl this Post!
Reply With Quote
Reply


Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

vB code is On
Smilies are On
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On
Forum Jump


All times are GMT. The time now is 08:43 AM.


Powered by vBulletin® Version 3.6.5
Copyright ©2000 - 2008, Jelsoft Enterprises Ltd.
Search Engine Optimization by vBSEO 3.2.0
www.UnixAdminTalk.com