Topic: Concatenate 400 files (Page 1 of 1) Pages that link to <a href="https://ozoneasylum.com/backlink?for=29671" title="Pages that link to Topic: Concatenate 400 files (Page 1 of 1)" rel="nofollow" >Topic: Concatenate 400 files <span class="small">(Page 1 of 1)</span>\

 
SleepingWolf
Paranoid (IV) Inmate

From:
Insane since: Jul 2006

posted posted 11-03-2007 19:56

I downloaded 409 html files using an offline browser.
I would now like to concatenate all 400 files into 1 file, stripping away all the html and other tags in the process.
Is there one utility that can do that.
If not, is there a good utility to concatenate the html files and another one to convert them all into 1 giant text file?
Thanks

I'll keep googling.

Nature & Travel Photography
Visit the Sleeping Wolves

edit: ok, i found 2 freeware utilities that did the conversion and concat very nicely.
now, anyone know a good freeware to remove extra whitespace/blank lines?


(Edited by SleepingWolf on 11-03-2007 21:22)

poi
Paranoid (IV) Inmate

From: Norway
Insane since: Jun 2002

posted posted 11-03-2007 21:46

You could make a macro in your text editor of choice

zavaboy
Paranoid (IV) Inmate

From: f(x)
Insane since: Jun 2004

posted posted 11-03-2007 22:34

I'm guessing this in in Windows... In the command line, make sure you are at where the files are, and then type:

code:
type *.html > everything.html


Then use a text editor with a regular expression or wildcard search and replace to remove all the HTML tags. Or, you can just view it in your browser, select all, copy, and paste in text editor.

Hope that helps!

SleepingWolf
Paranoid (IV) Inmate

From:
Insane since: Jul 2006

posted posted 11-03-2007 23:25

I've done the conversion to text, and the concatenation to 1 big text file.
I now need to strip out excess blank lines...found a utility called tab.exe but it doesn't work properly with this file.
I also wrote a vba program in excel but for some reason excel is adding "quotes" to some lines at random when i save to text..problem is i have no way to know which lines need the quotes and which ones don't....so I can't search and replace - the file has over 30,000 lines including the extra white space. The excel program was also required because I need to sort the text using a divider to let me know the start and end of each poem (it's a huge collection of my mom's poems)

Nature & Travel Photography
Visit the Sleeping Wolves

(Edited by SleepingWolf on 11-03-2007 23:28)

zavaboy
Paranoid (IV) Inmate

From: f(x)
Insane since: Jun 2004

posted posted 11-03-2007 23:39

I think this should work. CMD:

code:
findstr /x ^. oldfile.txt > newfile.txt





(Edited by zavaboy on 11-03-2007 23:39)

(Edited by zavaboy on 11-03-2007 23:41)

reisio
Paranoid (IV) Inmate

From: Florida
Insane since: Mar 2005

posted posted 11-04-2007 01:12

/me cringes, then hugs his Unix commandline

SleepingWolf
Paranoid (IV) Inmate

From:
Insane since: Jul 2006

posted posted 11-05-2007 04:45

I tried using Textpad with regular expressions (\n) and MS Word looking for ^p^p.
Both worked, but with some quirks. So I ended up writing a program in vba.

For the record, these 2 freeware utilities were excellent:

Simple File Joiner (look for the freeware version)

HtmlAsText.exe - did a clean job of removing scripts as well.

HTTrack was used for the offline browsing.

Nature & Travel Photography
Visit the Sleeping Wolves

lallous
Maniac (V) Inmate

From: Lebanon
Insane since: May 2001

posted posted 11-05-2007 09:29

cool post, thanks for sharing.

--
Regards,
Elias



Post Reply
 
Your User Name:
Your Password:
Login Options:
 
Your Text:
Loading...
Options:


« BackwardsOnwards »

Show Forum Drop Down Menu