TE
TechEcho
Home24h TopNewestBestAskShowJobs
GitHubTwitter
Home

TechEcho

A tech news platform built with Next.js, providing global tech news and discussions.

GitHubTwitter

Home

HomeNewestBestAskShowJobs

Resources

HackerNews APIOriginal HackerNewsNext.js

© 2025 TechEcho. All rights reserved.

Ask HN: I am a data analyst and my code is a mess

5 pointsby elliott34over 10 years ago
I have been thinking about if I&#x27;d get laughed at for asking this question for a while, but it&#x27;s gotten to the point where I really need some guidance.<p>I have a spaghetti code problem. I am a data scientist&#x2F;analyst, and my day to day is entirely in python&#x2F;sci-kitlearn&#x2F;pandas, data munging and running models. Right now my code is several hundred lines of data processing steps, filtering, lots and lots of joins and sql queries, pickle dumps and loads, print array.shape. I try to create as many functions as possible to help organize the code, and put different parts of the project into different scripts. I utilize ipython notebook on the cloud for the interactive portion of my analysis, and sublimetext2 for the fixed data processing scripts.<p>Long story short, I have a physics background and was never taught how to properly structure my workflow for this type of coding. Should I be creating more classes and objects?<p>Are there any resources out there on how to code and structure large machine learning projects like this? Or is it doomed to be spaghetti code?

5 comments

valarauca1over 10 years ago
The rule of thumb that most people stick with when doing OOP is duplicate code is bad.<p>The goal is to find data that needs to be grouped, and group it. Find functions that only use that grouped data, and stick them in classes.<p>For example a query can be an object. I.E.: A database connection (in java)<p><pre><code> public class DBconnect { private connection Con = null; public DBconnect(String Ip, int port) { this.connection = mkConnection(ip, port); } public Object query(String query) { return this.connection.ExecQuery(query); } } </code></pre> Then you query specific pre-processing code can be added directly into the query.<p><pre><code> public String query(String query, String regex) { return this.connection.ExecQuery(query).replaceAll(regex, &quot;&quot;); } </code></pre> Which results in code like<p><pre><code> DBConnect db = new DBConnect(127.0.0.1, 150); String[] quereies = { &quot;yada&quot;, &quot;yada&quot;, yada&quot;}; for(String str: queries) { String result = db.query(str, &quot;\\s+&quot;); doDataScience(result); } </code></pre> I don&#x27;t know if this helps. But its a suggestion.<p>P.S.: I&#x27;ve been spending my free nights the past 2 weeks trying to throw together a javascript based data processing engine in java. It should be mostly workable by the weekend. I could throw it on a ShowHN if you&#x27;d be interested.
评论 #8696009 未加载
yorpover 10 years ago
We are building Sclera, an extensible SQL engine that enables you to push your analytics operations into a SQL query. The idea is to tame the code complexity through a declarative interface to analytics libraries. You can add your own libraries using the Sclera Extensions SDK. <a href="http://www.scleradb.com/doc/sdk/sdkintro" rel="nofollow">http:&#x2F;&#x2F;www.scleradb.com&#x2F;doc&#x2F;sdk&#x2F;sdkintro</a><p>From the FAQ: <a href="http://www.scleradb.com/doc/info/faq#i-am-an-analytics-consultant-" rel="nofollow">http:&#x2F;&#x2F;www.scleradb.com&#x2F;doc&#x2F;info&#x2F;faq#i-am-an-analytics-consu...</a> why-do-i-need-sclera &gt; Specifically, Sclera separates the analytics logic from the processing and data access. The analytics logic is specified declaratively as SQL queries with Sclera’s analytics extensions. This is just a few lines of code, which can be changed easily. The analytics libraries, database systems and external data sources form their own modules and are separated from the analytics logic. The analytics queries are compiled by Sclera into optimized workflows that dynamically tie everything together.
Warewolf-ESBover 10 years ago
Hey Elliott It might be worth checking out Warewolf ESB - it&#x27;s a visual programming platform with flow-based programming principles. It&#x27;s primarily a service bus, but for your needs it will really help you move away from the &quot;spaghetti code&quot; and into a more modular, visual application. It&#x27;s open source and free:<p>Compiled version: <a href="http://warewolf.io" rel="nofollow">http:&#x2F;&#x2F;warewolf.io</a> Source code from GitHub: <a href="https://github.com/Warewolf-ESB/Warewolf-ESB" rel="nofollow">https:&#x2F;&#x2F;github.com&#x2F;Warewolf-ESB&#x2F;Warewolf-ESB</a>
mc_hammerover 10 years ago
not a python dev, but:<p>- python probably has a lib like underscore (reduce map filter etc), could help<p>- check out the quake source code, any version, its huge and the entire thing is not only readable but possibley a work of art.<p>- have you tried lambdas? to some its more readable.. ex:<p><pre><code> nums = range(2,50) for i in range(2, 8): nums = filter(lambda x: x == i or x % i, nums) </code></pre> personally when i have too complex process i like to go more functional, ex:<p><pre><code> main: prepare_data1() prepare_data2() do_long_stuff() nextstep() </code></pre> that allows me to focus on only on building one step and still have readable code.<p>many game-devs prefer breaking their project into many tiny files with a specific purpose instead of spaghetti, ex:<p><pre><code> file.py parser.py display.py function1.py function2.py </code></pre> its also a bit easier to nav around the project and make sense of it this way. you might want to check out rust or D or F or another lang also.
评论 #8696013 未加载
评论 #8694300 未加载
lovelearningover 10 years ago
Has somebody reviewed your code and called it spaghetti, or is it your own opinion?<p>If it&#x27;s your own opinion, then it&#x27;s possible you&#x27;re being unduly harsh on your own work. Perhaps you can publish it - or a suitable equivalent - on github and request people here for code reviews.