Very interesting!<p>>For most documents, we rely on Apache Tika to transform the original document into a canonical HTML representation, which then gets parsed in order to extract a list of “tokens” (i.e. words) and their “attributes” (i.e. formatting, position, etc…).<p>How good is really Apache Tike at this? I've messed about but its hard to find solutions that cover the base cases.<p>What are the recommendations for covering lets say PDF, OpenXML, and ODF?