PHP Classes

How to make a PHP PDF Search Engine as well Read DOCX, DOC, PDF Document File or Content String to Convert to Text

Recommend this page to a friend!
  Blog PHP Classes blog   RSS 1.0 feed RSS 2.0 feed   Blog How to make a PHP PDF...   Post a comment Post a comment   See comments See comments (12)   Trackbacks (0)  

Author:

Viewers: 3,852

Last month viewers: 366

Categories: PHP Tutorials

In the last decades, the massive digitalization of processes has made companies and individuals produce a lot of rich text documents in the DOCX, DOC and PDF formats.

This caused a problem because when we need to search the contents of these documents we need to look at the text content that they contain.

Read this article to learn how to solve the problem of searching and indexing these documents using a PHP class that can easily extract the text contents.




Loaded Article
Picture of Ash Kiswany
By Ash Kiswany Palestine

<email contact>

Contents

Introduction

Extracting the Text from Document Files

Searching the Document Text

Conclusion


Introduction

If we want to search DOCX, DOC and PDF files we need first to extract the text they contain. That way we can use the text and save to a database and have the database server perform the searches using query parameters, or we can perform the searches we want to do directly in the text using PHP code.

A solution for extracting the text from this kind of documents can be using the PHP DOC DOCX PDF to Text class. This class can extract text from PDF document files, as well Microsoft Word files, including the older versions that use a proprietary binary file format.

Extracting the Text from Document Files

With this example I will show you how easy is to convert any document to plain text:

<?php
 require( "class.filetotext.php" );

 $docObj = new Filetotext( "test.docx" );
 $return = $docObj -> convertToText();

 print $return;

As you can see, we include our class file then create a new Filetotext object which takes the file path as its parameter. Then we use convertToText() method on the object which returns the converted text.

Here follow another two examples. It is basically the same thing for any documents in the supported formats.

<?php
 require("class.filetotext.php");

 $docObj = new Filetotext("test1.doc");
 $return = $docObj -> convertToText();

 print $return;
<?php
 require( "class.filetotext.php" );

 $docObj = new Filetotext( "test2.pdf" );
 $return = $docObj -> convertToText();

 print $return;

Searching the Document Text

With this class it is a piece of cake to convert any DOCX, DOC or PDF to plain text. The resulting text may not be suitably formatted for display to users but it can well be used for searching purposes.

In some cases we can also modify the text and then save it back to a document file of the original format using with another class that can generate documents in the format we want.

Anyway, for searching purposes the text can be stored in any database, so we can perform searches of multiple documents with a single query. If you use a database like MySQL you can store the text in a field with an associated full text index.

CREATE TABLE documents (
 id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
 filename VARCHAR(255),
 contents TEXT(65535),
 FULLTEXT search_index (contents)
) ENGINE=InnoDB;

The we can use the MATCH expression to perform SELECT queries to search for given text inside the document text contents.

SELECT id, filename FROM documents WHERE MATCH (contents) AGAINST ('search keywords here' IN NATURAL LANGUAGE MODE);

If you just want to search a single document text, you can use for instance the strpos PHP function to search for some keywords in the text.

if( strpos( $contents, 'search keywords here') !== false)
{
 echo 'Keywords found!';
}

Conclusion

The PHP DOC DOCX PDF to Text class can make any document easy to search. It is a very good tool for creating a big database of books online where you can use the text of this documents to be crawled by search engines.

If you liked this article or have questions about extracting and searching text from document files, post a comment here.




You need to be a registered user or login to post a comment

Login Immediately with your account on:



Comments:

7. when convert file type doc - Fikri F (2021-03-05 07:38)
showing weird string... - 0 replies
Read the whole comment and replies

1. not working for pdf - asmara (2018-03-31 14:36)
this class nor working for pdf file... - 4 replies
Read the whole comment and replies

6. except docx other formats not working - Rizwan (2016-08-17 17:27)
only coverting docx to text not others... - 0 replies
Read the whole comment and replies

5. Easily Search DOCX, DOC, PDF - Gokul K (2016-07-20 13:47)
Easily Search DOCX, DOC, PDF... - 0 replies
Read the whole comment and replies

4. japanese PDFs - d (2016-05-13 15:10)
japanese PDFs... - 1 reply
Read the whole comment and replies

3. ejemplo - victor-ramirez (2016-04-06 20:36)
e... - 0 replies
Read the whole comment and replies

2. another way for pdf - gin (2015-11-19 16:53)
use ocr... - 0 replies
Read the whole comment and replies



  Blog PHP Classes blog   RSS 1.0 feed RSS 2.0 feed   Blog How to make a PHP PDF...   Post a comment Post a comment   See comments See comments (12)   Trackbacks (0)