This is a discussion on finding similar text in files in various subdirectories within the Slackware Linux Support forums, part of the Unix Operating Systems category; --> I don't know where to ask this so I ask it here. I hope noboyd minds as I have ...
| |||||||
| FAQ | Members List | Calendar | Search | Today's Posts | Mark Forums Read |
| ||||
| I don't know where to ask this so I ask it here. I hope noboyd minds as I have found the folks here very helpful before. Thanks for all your help. To my current question: I have multiple subdirectories within a single directory. Each subdirectory contains multiple text files with varying names. How do I find files which are very closely similar to each other. Yes, the text files are C source code files. Any simple and good script or a command line utility for this? Thanks in advance. A. |
| |||
| On Mon, 16 Jan 2006 05:38:56 -0800, anonymous wrote: > I don't know where to ask this so I ask it here. I hope noboyd minds as > I have found the folks here very helpful before. Really? That isn't usually the case with Windoze users posting from Google Groups... > To my current question: I have multiple subdirectories within a single > directory. Each subdirectory contains multiple text files with varying > names. How do I find files which are very closely similar to each other. > Yes, the text files are C source code files. A Linux user might use 'grep' and/or 'diff'. Not sure what you can use. man grep man diff -- If you're not on the edge, you're taking up too much space. Linux Registered User #327951 |
| |||
| On 2006-01-16, anonymous <call_ret@yahoo.com> wrote: > > To my current question: I have multiple subdirectories within a single > directory. Each subdirectory contains multiple text files with varying > names. How do I find files which are very > closely similar to each other. Yes, the text files are C source code > files. > > > Any simple and good script or a command line utility for this? The problem is that it's not a simple task. If you looking for files with contents similar to other files, or strings in one file that are similar to strings in another file....good luck... if a==c is the same as if c==a, same thing, different strings. If you're looking for identical functions across various c file...ctags or etags would probably help. Or....I'm not sure, but you could try something like swish-e indexing program that uses an agrep function which will find similar matches or fuzzy matches, again that's assuming I remember correctly. ken |
| |||
| On 2006-01-16, No_One <no_one@no_where.com> wrote: > On 2006-01-16, anonymous <call_ret@yahoo.com> wrote: > > Or....I'm not sure, but you could try something like swish-e indexing program > that uses an agrep function which will find similar matches or fuzzy matches, > again that's assuming I remember correctly. A follow up to my own post....swish doesn't use the agrep algorithm, it's glimpse that uses the agrep algorithm.... ken |
| |||
| anonymous wrote: > > To my current question: I have multiple subdirectories within a single > directory. Each subdirectory contains multiple text files with varying > names. How do I find files which are very > closely similar to each other. Yes, the text files are C source code > files. > This reminds me of a problem computer science professors had in trying to determine if students were cheating by turning in copied programs (sorry, it was long ago and I don't have references). Students were changing variable names, comments, and order of statements to try to hide the cheating. Possibly the professors had a C parsr and could determine simularities. If I had the same problem and didn't have a problem, I might create metrics of (1)count the number of of statements and subroutines in a file in a file (2) count of number of different type (like how many if, for, etc statments) for each subroutine,(3) call tree structure of each file. I'm sure you can think of more metrics. If two files had the same or very similar metricis I would examine them more closely. If detecting cheating isn't your problem, perhaps creating some sort of metrics might help. Hope that helps klee12 |
| |||
| Ooops several types. 3 sentence in main paragraph should read something like: If I had the same problem and didn't have a **parser** I might create metrics **like** (1)count the number of of statements and subroutines in a file (2) count of number of different ** types of statements ** (like how many if, for, etc statments) for each subroutine,(3) call tree structure of each file. Sorry klee12 |
| |||
| i guess a crude solution might be simply to compare each file with every other, and send the output of diff to wc, logging the results and grouping files with diff|wc below a given threshhold. if "very closely similar" means slightly edited versions of the same original file, this would probably do the trick. i'm sure a bash script could do all this (but i'm not going to write it!). Message posted via: ===================== www.linuxpackages.net/forum www.linuxpackages.net Expanding the world of Slackware ===================== |
| ||||
| anonymous wrote : > Any simple and good script or a command line utility for this? I was out surfing for something else when I tumbled across this utility named 'finddouble 1.4', maybe it can be used: <URL: http://bsegonnes.free.fr/en_projects.html> The program comes both as source and as a bash script so there's something to start with. -- Thomas O. This area is designed to become quite warm during normal operation. |
| Thread Tools | |
| Display Modes | |
|
|