I need to extract certain information from research publications, such as species, biomass, geographic location, and maybe related environmental data. Assume that I will convert PDF to text and, if necessary, do OCR. But here's the catch: other species with similar data can be quite close to my target species on the same page, paragraph, sentence, or in the same table. Moreover, indicator values can be quite close or the same (e. g., biomass B = 1.2 kg/m^2), as the species are from the same Genus. For example, Mytilus has 3 species (actually more) - Mytilus edulis, Mytilus trossulus, and Mytilus galloprovincialis.
How would an algorithm with no prior knowledge determine that a specific value relates to my target species rather than, say, the one adjacent to it in the same table or paragraph? I'm a human, and I know what to look for as I have prior knowledge, but I cannot process hundreds or thousands of articles as quickly as a machine can.
Does anyone have experience in using a tool that can correctly parse such information after appropriate setup? I am aware of:<p>- HN search results (https://hn.algolia.com/?q=information+extraction)<p>- Apache Tika (https://tika.apache.org/)<p>- Apache OpenNLP (https://opennlp.apache.org/)<p>- Apache UIMA (https://uima.apache.org/external-resources.html)<p>- GATE (https://gate.ac.uk/)<p>But I am not sure if any of these can do the job, as I haven't used them. I also know that there are companies that have developed similar solutions (https://www.ontotext.com/knowledgehub/case-studies/ai-content-generation-in-scientific-communication/), possibly by using GraphDB.
In addition, what is the best data storage solution? In one case, you extract a table from the publication, whereas in another - a single data point. It's not worth the effort of creating a separate table for a single data point.
What would be the right approach, software (library) and possible workflow and data storage solution in this case?