Main page Tropes, Semantic Text Analysis - Online Reference Manual 
  info@semantic-knowledge.com 
Home | News | Reference | Support | Download | Buy | About 

CHAPTER 5 - Appendixes

File conversions

Observations about file format conversions:

Format

Extension

Description

ASCII

*.ASC

The software uses Windows API function OemToChar to convert this format in ANSI ISO-8859-1.

HTML

*.HTM

See XML remarks (below) for HTML (Hypertext Markup Language) file format.

Macintosh
text files

*.MAC

The Apple Macintosh Latin character sets are converted to ANSI ISO-8859-1.

Microsoft Powerpoint

*.PPT
*.PPTX

The software makes a native conversion of Microsoft PowerPoint files using the IFilter divers on your system. The installation of Microsoft Office (or the equivalent 32-bit IFilter pack) is a prerequisite for these documents

Microsoft Word

*.DOC
*.DOCX

The software makes a conversion of Microsoft Word files using the 32 bits IFilter divers on your system. Besides this, the software can extract the text by binary analysis of Word 97-2003 files, for problematic documents. The installation of Microsoft Office (or the equivalent 32-bit IFilter pack) is a prerequisite for DOCX documents.

OpenOffice documents

*.ODT
*.ODP

The software makes a native conversion of OpenDocument Text or Presentation files using the IFilter divers on your system. The installation of OpenOffice (or the equivalent 32-bit IFilter pack) is a prerequisite for these documents.

PDF

*.PDF

Software uses a specific IFilter driver to extract the relevant text of PDF (Portable Document Format) files. If Adobe Reader installation is not needed, it is necessary on the other hand to install a PDF IFilter driver to benefit from this file format. The usage of external character recognition software (OCR) may be necessary for some files (those that come from a digitalization by scanner).

RTF

*.RTF

The software interprets RTF (Rich Text File) formats containing characters in ANSI ISO-8859-1, Apple Macintosh, or ASCII IBM, coded on 7 or 8 bits. Unicode format is not accepted. Because disparities exist between RTF standards, parasite characters can appear in certain files.

SGML

*.SGM

The software discards the tags of SGML (Standard Generalized Markup Language, ISO 8879) and convert some HTML specific variables (as characters with accents). It does not interpret the DTD. UTF-8 format (UCS transformation Format-8, ISO 10646 / RFC 2279) is automatically converted in ANSI ISO-8859-1. The other Unicode formats (Universal Character Set, ISO 10646) are not directly accepted.

Text

*.TXT

No conversion is made on this file format, considered in ANSI ISO-8859-1 (a.k.a. ISO-LATIN1, or ANSI Windows) or in Unicode UTF-8 (if a Byte Order Mark is in the file header).

XML

*.XML

The software uses its SGML parser (see above) to read XML files (Extensible Markup Language). The same limitations are so applicable to this format. The engine does not interpret scripts or style sheets.

Tropes shows the status of the filters currently installed in your system in the [Information] box of the [General options] dialog.

Tropes software use the existing IFilters installed on your system. If necessary, you can download the latest 32-bit version of Microsoft Filter Pack on Microsoft's web site. For more information about IFilter technology and error codes, read the Microsoft documentation (http://msdn.microsoft.com/en-us/library/ms691105%28VS.85%29.aspx).

Note that the PDF IFilter is usually automatically installed with Adobe Reader.

Software uses an internal component to convert documents Microsoft Word 97/2003 files. In case of problem (corrupted file, etc.) conversion is tipped over automatically on the binary analysis (heuristic), which is going to try to get back the text by binary origin. If you notice that software jams on certain Word files, you can deactivate conversion using the Analysis options dialog (tabsheet [Conversions], deepened in the parameters of indexation if you are using Zoom).

In every case, the password protected files can not be read and be directly converted by the software. You have to convert these files manually to analyze them.

When software makes a binary analysis (heuristic) to extract the text, it is possible that characters parasites appear to the posting. By definition, this method of origin of the text can not be perfect, because software does not take into account the native file format.

For more information about the supported files, see:


First page Previous Next Last page

Copyright Acetic and Semantic Knowledge, all rights reserved
www.semantic-knowledge.com