It’s much more limited in scope but fully open source and highly customisable. In fact it’s made for people to build their own pipelines on top of, providing the scaffolding needed to do so in a reliable way.
During development I’ve found it to be hard to truly generalise agent/llm-based data extraction, especially around the unlimited number of input types without task specific instructions (many files of the same kind, single large files, mixed kinds, bad quality files, docx/pdf/png/… the list goes on). Users sadly wanna upload all of these, and developers want a „one size fits all“ solution.
I am interested in how your solution deals with this. I came up with a strategy based approach so every task can be customised if needed, but I’d be delighted to see a technical writeup of how you deal with this endless variety of input + extraction task combos! :)
[−]gergelycsegzi · 2026-07-01 Wed 18:32 UTC ·
link
I'll need to check it out!
We had the same observation in that the possible space is almost endless, and for example even for the same file type there may be different kind of processing required (e.g. an excel can be database style, vs small narrative heavy, or both).
We have baked in some ground processing rules for different kinds of documents, and we do allow custom instructions on how to deal with specific cases (e.g. translations, particular format layouts). The best write-up I have at the moment is https://www.parsewise.ai/doc-processing-pipelines but we're working on something that goes into more detail:)
It’s much more limited in scope but fully open source and highly customisable. In fact it’s made for people to build their own pipelines on top of, providing the scaffolding needed to do so in a reliable way.
During development I’ve found it to be hard to truly generalise agent/llm-based data extraction, especially around the unlimited number of input types without task specific instructions (many files of the same kind, single large files, mixed kinds, bad quality files, docx/pdf/png/… the list goes on). Users sadly wanna upload all of these, and developers want a „one size fits all“ solution.
I am interested in how your solution deals with this. I came up with a strategy based approach so every task can be customised if needed, but I’d be delighted to see a technical writeup of how you deal with this endless variety of input + extraction task combos! :)
We had the same observation in that the possible space is almost endless, and for example even for the same file type there may be different kind of processing required (e.g. an excel can be database style, vs small narrative heavy, or both).
We have baked in some ground processing rules for different kinds of documents, and we do allow custom instructions on how to deal with specific cases (e.g. translations, particular format layouts). The best write-up I have at the moment is https://www.parsewise.ai/doc-processing-pipelines but we're working on something that goes into more detail:)