Let me be clear upfront about what I mean by working with data. Generally working with data means that you have some form of data and you do some kind of analysis on it. Now once I have very clean structured data already available data analysis becomes easy.<p>But what if I have data that is not clean in form of files or databases? Then I have to read the files, clean it up and structure it in data structures according to my need (is THIS called parsing?). I am talking about this preprocessing part.<p>What CS or programming subjects should I study to become somewhat of an expert in data cleaning, preprocessing and structuring large amounts of files in batches?<p>I am also interested in the second part of the pipeline where I analyse the data and produce output both in terms of good visualisations and output data to be stored in files.<p>Any books/courses or any other types of resource pointers will be appreciated.<p>P.S.: Files can be anything. They are just streams of bytes. Images, audio, video, text, csv.
Others may disagree but this all part of data engineering, data science, data analytics, data visualization to me. Google those terms, read up, choose some free courses to get started. Often machine learning and “AI” get lumped into this area too.<p>There’s probably useful free data sets out on the internet. Learning python is useful.<p>I know AWS has a heap of services catering to data pipe lines.. maybe see if there’s a free tier on anything.<p>The fixing of bad data I’ve most commonly heard of as “data cleansing” or “data scrubbing”.