An important component for the ACP system is the text extraction from source documents. The various options for text extraction were to be explored, however, limiting the scope to only certain type of documents. (This constrain was set at the start of the project considering team size, work load and time constrains).
Accessibility API
The prototype version of ACP as developed by previous developers used Accessibility API for text extraction. Using accessibility API has the following advantages and disadvantages:
Advantages | Disadvantages |
|---|---|
1) In-built API provided is by the windows platform and therefore there is no need for an third-party library. | 1) Can only be used with windows systems, for other operating system the API may be different or may not even exist. |
2) Stable and can be called directly using C# or Java program. | 2) The extracted text has no structure information, everything is dumped as plain text. (No paragraphs, footers or headers differentiated) |
3) Can be easily extended to extract text from a variety of sources. | 3) Sometimes, even unwanted text is extracted which can be clearly seen to be of no use to the user. |
Browser Plugin (Firefox)
Another option that was explored was extraction of HTML from the browser. This could be easily done using an add-on (plugin) installed in the browser that could communicate with the ACP system. Following are the advantages and disadvantages of using a browser plugin:
...
Based on the discussions among the team and the consultation with the supervisor, we decided it was a good idea to explore the Browser plugin option as it has an added advantage of providing context and more meaningful text to the ACP. Since the accessibility API was already explored by the previous developer and it was not a very accurate solution for our requirement, it was concluded that we could explore something new which can open scope for future opportunities.