This is a major part of my job - we're a small team that works with close to 100 partners in the nonprofit sector, who all store similar data differently in CSV, PDFs, 2-3 industry-specific CRMs, etc. that we need to standardize and load. Our partners have small datasets (usually <2000 rows, maybe 20 columns or so), that are most of the time is extremely poorly formatted. We work in a single domain, so each partner's data is largely a different formatted version of the same common elements.<p>We have a template with accompanying documentation - partners with technical teams and well-structured data can basically self-serve with SQL. We have a series of meetings where we map columns and values, review errors, etc. More irritating than the data transformations is understanding the structure of the data and practical use, e.g., the same column for Partner A means something entirely different for Partner B. Transforming it might not be the problem - making sure everyone across our teams understands what it means and where it should go is a large pain point in mapping, but the coding of this logic is trivial.<p>For non-technical partners where we handle the prep, over time I wrote a small internal Python library just building off DataFrames that contains a few hundred tests to validate the data, plus a set of commonly used data cleaning functions that are flexible in their inputs. We connect to APIs to verify addresses, time zones, where we can. We now use LLMs more frequently to parse and structure fields to our standard, but it is still painful to review results from the LLM and ensure correctness. Each incoming file will result in an accompanying Jupyter notebook that runs all the relevant scripts.<p>Issue with working Excel (formulas, Office JS/PY Scripts, manual edits) has always been version tracking - difficult to replicate work if new versions of files come in while we are prepping the data. If we find an error post-import, it is great to be able to track down where and why we made the decision or coding error. I haven't tried Power Query though. I have tried OpenRefine, but I think sometimes it slows me down for easy functions, and API-based data transformations becoming a separate task.<p>When we have interns, coordinating cleaning / prep on a single file across users can be tricky.<p>We did try an internal POC of a UI based tool to allow users to map their own data, start cleaning it, but honestly, either a) the partner would have a data analyst or steward that didn't need it, or b) the partner wouldn't have enough of a skillset to feel comfortable cleaning it. Plus pretty often we'll need to join data from multiple spreadsheets, but conditionally use certain rows from one or the other to get the most up-to-date data, which can be difficult to do. Didn't feel as if this was worth the effort to continue with.<p>Fuzzy de-duplication validation is a pain, or anything where you actually want to check with the partner about its correctness - like if I notice that an email is spelled correctly because it almost matches the person's last name, but 1 letter different - becomes a long list of bullet points in an email. Something I would like is an easy way to flag and share errors and have a partner correct those in a friendly UI, then store those changes as code, without a long series of emails or meetings. We've a few times had an Excel file uploaded to Sharepoint, then with a script that automatically adds a comment to cells with errors -> but again, some people just aren't comfortable working with Excel, or get confused when they see a very structured version of their data with additional formatting, and there is no convenient way to track or replicate their changes.<p>There are always also one-off corrections that will need to be made as well - I've just sort of accepted this, and so we generate a list of warnings / recommendations for the partner to review or correct post-import based on the test results rather that us trying to do it. That has worked fine.