(Also at the GN)
Hi there all!
I'm building a little thing in php that lets the user create a file archive, add categories and set which filetypes that should be allowed for upload and to upload files + a small description of the file. All info except the actual file is stored in mysql. The files are in the filesystem.
Then, of course you should be able to search in the archives.
Now, all this is done including the search inside an archive with or without categories plus free text in the description in the database.
However, I want to be able to search inside the files as well...
I've got it working so I can search inside .txt, .html, .doc but I'm stumped trying to search inside a pdf-file.
I have by now searched quite a lot and while this question comes up a lot all over the net, all solutions point to using "xpfd", "ghostscript" and other "external" things.
I'd really like this to be as independent of the server-installed things as possible, meaning pure php.
Here is the code that searches inside all files except pdf:
(It uses a class I found here http://www.phpclasses.org/browse.html/package/702.html during my searches but it doesn't work...)
code:
function searchInFile($selArch,$searchText){
$baseDir = "archives/";
if(($selArch !="")&&($searchText !="")){
print("<strong>Söker i filer i hela arkivet...</strong><br>");
$archiveName = getArchiveName($selArch);
$path = $baseDir.$archiveName."/";
$dir = opendir($path);
$count = 0;
$hits = 0;
while (($file = readdir($dir))!=false){
if ($file != "." && $file != ".."){
$ext = substr($file,-4);
//print($ext."<br>");
if($ext == ".pdf"){
print("<br><strong>Söker i pdf...</strong><br>");
$content = implode('',file($path.$file));
// Allocate class instance
$pdf = new pdf_search($content);
// And do the search
if ($pdf->textfound($searchText)) {
//echo "We found $searchText in $path$file<br>";
$hits++;
print("<a href=\"".$path.$file."\" target=\"_blank\">".$file."</a><br />Ordet <i>".$searchText." funnet i filen</i><br />\n");
}else{
echo "$searchText kunde ej hittas i pdffiler.<br>";
}
}else{
$dok = implode('',file($path.$file));
$count++;
if(stristr($dok,$searchText)){
$hits++;
print("<a href=\"".$path.$file."\" target=\"_blank\">".$file."</a><br />Ordet <i>".$searchText." funnet i filen</i><br />\n");
}
}
}
}
print($count." fil(er) genomsökt(a), <strong>".$searchText.
"</strong> funnet i ".$hits." dokument.<br />");
}
}
And here's the class:
code:
<?
/**********************************************************************
**
** A class to search text in pdf documents.
** Not pretending to be useful other than that.
** But it can easily be extended to a full featured pdf document
** parser by anyone who chooses so.
**
** Author: Rene Kluwen / Chimit Software <rene.kluwen@chimit.nl>
**
** License: Public Domain
** Warranty: None
**
***********************************************************************/
class pdf_search {
// Just one private variable.
// It holds the document.
var $_buffer;
// Constructor. Takes the pdf document as only parameter
function pdf_search($buffer) {
$this->_buffer = $buffer;
}
// This function returns the next line from the document.
// If a stream follows, it is deflated into readable text.
function nextline() {
$pos = strpos($this->_buffer, "\r");
if ($pos === false) {
return false;
}
$line = substr($this->_buffer, 0, $pos);
$this->_buffer = substr($this->_buffer, $pos + 1);
if ($line == "stream") {
$endpos = strpos($this->_buffer, "endstream");
$stream = substr($this->_buffer, 1, $endpos - 1);
$stream = @gzuncompress($stream);
$this->_buffer = $stream . substr($this->_buffer, $endpos + 9);
}
return $line;
}
// This function returns the next line in the document that is printable text.
// We need it so we can search in just that portion.
function textline() {
$line = $this->nextline();
if ($line === false) {
return false;
}
if (preg_match("/[^\\\\]\\((.+)[^\\\\]\\)/", $line, $match)) {
$line = preg_replace("/\\\\(\d+)/e", "chr(0\\1);", $match[1]);
return stripslashes($line);
}
return $this->textline();
}
// This function returns true or false, indicating whether the document contains
// the text that is passed in $str.
function textfound($str) {
while (($line = $this->textline()) !== false) {
if (preg_match("/$str/i", $line) != 0) {
return true;
}
}
return false;
}
}
?>
As the class is above, this line $stream = @gzuncompress($stream); returns an error and it obviously doesn't decompress the file.
Anyone here have any ideas?
I'm running this on w2k with Apache 1.3.20 and php 4.0.6, support by PDFlib GmbH Version 4.0.0 is enabled (but it says the beta has expired...) zlib 1.1.3 is enabled as well.
/Dan
{cell 260}
-{ a vibration is a movement that doesn't know which way to go }-