Closed Thread Icon

Topic awaiting preservation: Searching pdf using php? Pages that link to <a href="https://ozoneasylum.com/backlink?for=12662" title="Pages that link to Topic awaiting preservation: Searching pdf using php?" rel="nofollow" >Topic awaiting preservation: Searching pdf using php?\

 
Author Thread
DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-20-2003 16:21

(Also at the GN)

Hi there all!
I'm building a little thing in php that lets the user create a file archive, add categories and set which filetypes that should be allowed for upload and to upload files + a small description of the file. All info except the actual file is stored in mysql. The files are in the filesystem.

Then, of course you should be able to search in the archives.

Now, all this is done including the search inside an archive with or without categories plus free text in the description in the database.

However, I want to be able to search inside the files as well...
I've got it working so I can search inside .txt, .html, .doc but I'm stumped trying to search inside a pdf-file.

I have by now searched quite a lot and while this question comes up a lot all over the net, all solutions point to using "xpfd", "ghostscript" and other "external" things.

I'd really like this to be as independent of the server-installed things as possible, meaning pure php.

Here is the code that searches inside all files except pdf:
(It uses a class I found here http://www.phpclasses.org/browse.html/package/702.html during my searches but it doesn't work...)

code:
function searchInFile($selArch,$searchText){
$baseDir = "archives/";
if(($selArch !="")&&($searchText !="")){
print("<strong>Söker i filer i hela arkivet...</strong><br>");
$archiveName = getArchiveName($selArch);
$path = $baseDir.$archiveName."/";
$dir = opendir($path);
$count = 0;
$hits = 0;
while (($file = readdir($dir))!=false){
if ($file != "." && $file != ".."){
$ext = substr($file,-4);
//print($ext."<br>");
if($ext == ".pdf"){
print("<br><strong>Söker i pdf...</strong><br>");
$content = implode('',file($path.$file));
// Allocate class instance
$pdf = new pdf_search($content);
// And do the search
if ($pdf->textfound($searchText)) {
//echo "We found $searchText in $path$file<br>";
$hits++;
print("<a href=\"".$path.$file."\" target=\"_blank\">".$file."</a><br />Ordet <i>".$searchText." funnet i filen</i><br />\n");
}else{
echo "$searchText kunde ej hittas i pdffiler.<br>";
}
}else{
$dok = implode('',file($path.$file));
$count++;
if(stristr($dok,$searchText)){
$hits++;
print("<a href=\"".$path.$file."\" target=\"_blank\">".$file."</a><br />Ordet <i>".$searchText." funnet i filen</i><br />\n");
}
}
}
}
print($count." fil(er) genomsökt(a), <strong>".$searchText.
"</strong> funnet i ".$hits." dokument.<br />");
}
}


And here's the class:

code:
<?

/**********************************************************************
**
** A class to search text in pdf documents.
** Not pretending to be useful other than that.
** But it can easily be extended to a full featured pdf document
** parser by anyone who chooses so.
**
** Author: Rene Kluwen / Chimit Software <rene.kluwen@chimit.nl>
**
** License: Public Domain
** Warranty: None
**
***********************************************************************/

class pdf_search {

// Just one private variable.
// It holds the document.
var $_buffer;

// Constructor. Takes the pdf document as only parameter
function pdf_search($buffer) {
$this->_buffer = $buffer;
}

// This function returns the next line from the document.
// If a stream follows, it is deflated into readable text.
function nextline() {
$pos = strpos($this->_buffer, "\r");
if ($pos === false) {
return false;
}
$line = substr($this->_buffer, 0, $pos);
$this->_buffer = substr($this->_buffer, $pos + 1);
if ($line == "stream") {
$endpos = strpos($this->_buffer, "endstream");
$stream = substr($this->_buffer, 1, $endpos - 1);
$stream = @gzuncompress($stream);
$this->_buffer = $stream . substr($this->_buffer, $endpos + 9);
}
return $line;
}

// This function returns the next line in the document that is printable text.
// We need it so we can search in just that portion.
function textline() {
$line = $this->nextline();
if ($line === false) {
return false;
}
if (preg_match("/[^\\\\]\\((.+)[^\\\\]\\)/", $line, $match)) {
$line = preg_replace("/\\\\(\d+)/e", "chr(0\\1);", $match[1]);
return stripslashes($line);
}
return $this->textline();
}

// This function returns true or false, indicating whether the document contains
// the text that is passed in $str.
function textfound($str) {
while (($line = $this->textline()) !== false) {
if (preg_match("/$str/i", $line) != 0) {
return true;
}
}
return false;
}
}

?>


As the class is above, this line $stream = @gzuncompress($stream); returns an error and it obviously doesn't decompress the file.
Anyone here have any ideas?

I'm running this on w2k with Apache 1.3.20 and php 4.0.6, support by PDFlib GmbH Version 4.0.0 is enabled (but it says the beta has expired...) zlib 1.1.3 is enabled as well.

/Dan

{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

InI
Paranoid (IV) Mad Scientist

From: Somewhere over the rainbow
Insane since: Mar 2001

posted posted 03-20-2003 16:26

The poster has demanded we remove all his contributions, less he takes legal action.
We have done so.
Now Tyberius Prime expects him to start complaining that we removed his 'free speech' since this message will replace all of his posts, past and future.
Don't follow his example - seek real life help first.

DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-21-2003 10:03

Thanx InI!
I'll definatley look into the link for pdf4php, if it works its much better for me than having to rely on external tools.

The issue with the class was simply that it didn't seem to uncompress using "$stream = @gzuncompress($stream);" therefore it naturally was impossible to find the "$searchText" in the compressed data.

I just uploaded it on a Linux box and sure enough, it works there...
Grrr, I want it to work under winblows as well...

So now I have a different but related question:
"gzuncompress()" What does this use? zLib or something else?
Is this supported on a windows installation?

I'll try to upgrade my php installation to see if that helps.
/Dan

{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-21-2003 10:32

Ok, just upgraded to php 4.2.3 and it didn't help.
I'm still getting "Warning: gzuncompress: data error " on my w2k installation...


<edit>
Grrr... it seems like the class pdf4php uses gzcompress as well to compress the pdf-file...
Won't work for me then... and the test seems to prove it...
</edit>

{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

[This message has been edited by DmS (edited 03-21-2003).]

InI
Paranoid (IV) Mad Scientist

From: Somewhere over the rainbow
Insane since: Mar 2001

posted posted 03-21-2003 10:56

The poster has demanded we remove all his contributions, less he takes legal action.
We have done so.
Now Tyberius Prime expects him to start complaining that we removed his 'free speech' since this message will replace all of his posts, past and future.
Don't follow his example - seek real life help first.

DmS
Paranoid (IV) Inmate

From: Sthlm, Sweden
Insane since: Oct 2000

posted posted 03-21-2003 12:38

Thanx InI.
As I said it works on Linux so I'm ruling out illegal circular reference.
I found the bug-report and similar things, the sizeOf($stream) returns 1 and I've tried to add an int-value as second param, no luck...

I'll keep digging
/Dan

{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-

« BackwardsOnwards »

Show Forum Drop Down Menu