An important component for the ACP system is the text extraction from source documents. The various options for text extraction were to be explored, however, limiting the scope to only certain type of documents. (This constrain was set at the start of the project considering team size, work load and time constrains).
Accessibility API
The prototype version of ACP as developed by previous developers used Accessibility API for text extraction. Using accessibility API has the following advantages and disadvantages:
Advantages |
Disadvantages |
|---|---|
1) In-built API provided is by the windows platform and therefore there is no need for an third-party library. |
1) Can only be used with windows systems, for other operating system the API may be different or may not even exist. |
2) Stable and can be called directly using C# or Java program. |
2) The extracted text has no structure information, everything is dumped as plain text. (No paragraphs, footers or headers differentiated) |
3) Can be easily extended to extract text from a variety of sources. |
3) Sometimes, even unwanted text is extracted which can be clearly seen to be of no use to the user. |
Browser Plugin (Firefox)
Another option that was explored was extraction of HTML from the browser. This could be easily done using an add-on (plugin) installed in the browser that could communicate with the ACP system. Following are the advantages and disadvantages of using a browser plugin:
Advantages |
Disadvantages |
|---|---|
1) It can extract HTML instead of the text, and this can be useful in identifying useful entities in the document. (urls, emails etc.) |
1) The user is required to download/install the plugin into the browser. |
2) It can allow extraction of only meaningful information and ignoring unwanted entities such as advertisements, footers etc from web pages. |
2) The plugin can only extract browser information and cannot extend to other source types |
3) It can be used cross-platform as it depends on the browser and not on the OS. |
3) People use different browsers and so a plugin for each of them will be required. |
Based on the discussions among the team and the consultation with the supervisor, we decided it was a good idea to explore the Browser plugin option as it has an added advantage of providing context and more meaningful text to the ACP. Since the accessibility API was already explored by the previous developer and it was not a very accurate solution for our requirement, it was concluded that we could explore something new which can open scope for future opportunities.