Ask HN: I have to analyze 100M lines of Java – where do I start?

87 点作者 user1241320超过 10 年前

As part of a huge "let's see what's going on here and re-build this from scratch" they dumped the whole code repository on me and my team.We've started parsing it and tried to work on extracting abstract syntax trees and all that.Any idea would help us a great deal.Thanks.

50 条评论

sergiosgc超过 10 年前

For rewrite from scratch projects, I always start by identifying the use cases covered by the application. You don't need the code for that. Just run the application and identify what it is that it does. Then, work backwards. For each use case, use the existing code as specification of the use case behavior.At 100 million lines, I'd suspect this is either an extremely large project, where a rewrite from scratch is inadvisable, or that there is a code generator at work. If it is the latter, you want to analyze the code generating source, not the end result.Anyhow, generically, for a first contact with a new code base, code coverage tools are a good start, as is a call graph debug run of the project. It'll let you spot dead code as well as hot code (code being called at every run of the application). It'll highlight the important and non-important code parts, allowing you to read less code and get a grasp on the architecture.

评论 #8259296 未加载

评论 #8259732 未加载

评论 #8259862 未加载

评论 #8261206 未加载

goshx超过 10 年前

10 years ago I worked on a large project to re-write a code base written in C. Our approach was to forget about the code and document everything it did from the user's perspective. Once everything was mapped out we decided on what we were going to keep, modify or remove, and then started building everything from scratch. You can always go back to the original code to see how a particular feature was implemented and perhaps re-use the same logic.

评论 #8259653 未加载

评论 #8259968 未加载

评论 #8260098 未加载

评论 #8260119 未加载

michaelvkpdx超过 10 年前

With a codebase like that, it's better to look at it through the users' eyes, rather than trying to reverse engineer the business from the code. Things that look like bugs in the code may actually be features for the users, or may have been absorbed so long ago that they've fundamentally changed the nature of the business.You don't need to understand the whole codebase. It will take years. Best to focus on what the users need and analyze small chunks. If it's truly 100M lines, there's not going to be any semblance of consistency in the code.You can also slap New Relic on it and you may be amazed at what you learn, right away.Don't waste too much time trying to understand all the code. Focus on a couple issues first, make some hypothesis, and then see how well your understanding of the code fits the bigger picture. Refactor and repeat.

评论 #8257527 未加载

logn超过 10 年前

Source to UML: <a href="http://www.architexa.com/" rel="nofollow">http://www.architexa.com/</a>Getting call paths: <a href="https://github.com/gousiosg/java-callgraph" rel="nofollow">https://github.com/gousiosg/java-callgraph</a>Line coverage from instrumented jars: <a href="http://emma.sourceforge.net/" rel="nofollow">http://emma.sourceforge.net/</a>For this type of request, I'd push back and say, let's identify very small parts of this and begin rewriting those one at a time in an isolated project. Kind of an agile rewrite that will combine the legacy project with the slowly rewritten one. Use the tools to identify parts of the project than can be isolated. Build new interfaces or services to let the old project communicate with the new one. Get a history of the source repository to see where recent edits are and prioritize those to be rewritten first (presuming they want a rewrite to lower maintenance costs).

评论 #8257466 未加载

bjackman超过 10 年前

You haven't really described your goals: What do you want to extract from your analysis? Metrics to tell you what's "wrong" with the existing code base? Some sort of model of the system's semantics?

评论 #8257401 未加载

评论 #8257396 未加载

rashthedude超过 10 年前

What type of application has 100 million loc's? Windows 7 has 40 million lines of codes so I'm wondering why type of application/software it is.

评论 #8257521 未加载

cyrillevincey超过 10 年前

Before rebuilding anything piece of software from scratch, I would give a serious look at this amazing bunch of wisdom: <a href="http://www.joelonsoftware.com/articles/fog0000000069.html" rel="nofollow">http://www.joelonsoftware.com/articles/fog0000000069.html</a>

评论 #8257923 未加载

EtienneK超过 10 年前

1) Focus on the functional use-cases and not code.2) Identify integration points to other systems and ask why they are there3) Realize that a "big-bang" rebuild never works and that it's better to break up the system into smaller pieces and replace them piece by piece.

mml超过 10 年前

Funny, I have 50,000 individual Java apps to analyze. I started with a copy/paste detector. Pmd has a free one. Good luck!

评论 #8259277 未加载

sp332超过 10 年前

As a first pass, try deleting as much code as possible :) If there are files or whole projects that aren't needed anymore, they're just slowing down your analysis. Also some dead-code analysis could be helpful, at least in broad strokes. You could instrument the code with a test coverage tool, then run the code instead of the tests to see what code gets reached.Edit: You could also look for duplicated code, and quickly refactor that to just be in one place.

radicalbyte超过 10 年前

Do you and your team have experience of Java development?Your question sounds like something that someone with either no real experience and/or no experience of an object-oriented language would ask.100 million lines is a lot of code. Why do you need to "parse it to extra the AST"? That's crazy.Do you have the original design documents and architectural documentation? If you do, read it.

评论 #8257637 未加载

xradionut超过 10 年前

Here's a suggestion I haven't seen: Unless you have full management support, a skilled team, valid business reasons for this conversion, and expectations of succeeding, consider moving to another company/job.You've been given the task of digital archeology/septic cleanup. Unless you like the tedium and stank, it's not going to bode well...

dugmartin超过 10 年前

Understanding the "shape" of a codebase is something I've always been interested in and I started building a tool to help me understand and traverse code here:<a href="http://sherlockcode.com/" rel="nofollow">http://sherlockcode.com/</a>However I don't think it would scale to 100M lines of code. I have run Linux through it and it was acceptable (both in run times and browse times). At 100M lines of code you need some way to see an overall "map" of the codebase and then drill in to the bits you are interested in. Just linking via symbols like SherlockCode does is too micro of a view.There are a lot of interesting visualization tools out there both commercial and academic. I don't have any Java specific ones to recommend but a quick Google search for "java code visualization tools" shows a lot of promise.

评论 #8259377 未加载

myang超过 10 年前

A very rough estimate: assuming you have 10 experienced developers on the team, each can read and comprehend 1000 lines of code per hour. Given a 10-hour workday, the team can digest 100000 lines of code per day. To finish just reviewing the code, it will take 1000 days, about 2 years and 8 months. Not sure how much time you have and when you would expect to deliver the final product. On the other hand, if you can find out the use cases and even take a look at the current product, you then may not have to review the source code but just go ahead implementing the features.

FollowSteph3超过 10 年前

I don't think you can just do a cold re-write of that size without domain knowledge. I would first try to refactor the existing system just to reduce the code size. That big a system probably has horrific code and you can easily shrink it quickly. Just finding duplicate code will have an impact. Pulling out to open source systems like file utilities based code.Basically I would first try to reduce the size of the problem while trying to get domain expertise. I wouldn't consider a rewrite at this stage...

Koziolek超过 10 年前

1. Configure Jenkins builds 2. Add PMD - code analyzing 3. Add Sonar - code analyzing (they has different rules than PMD) 4. Use Archeology 3d > <a href="https://github.com/pslusarz/archeology3d" rel="nofollow">https://github.com/pslusarz/archeology3d</a> to visualizing your code stats.But before you start just pray to Omnissiah (<a href="http://warhammer40k.wikia.com/wiki/Machine_God" rel="nofollow">http://warhammer40k.wikia.com/wiki/Machine_God</a>).

jacquesm超过 10 年前

Callgraph. Then document the larger chunks, working your way down.It's like having a map versus having no map at all.And 100M lines? Are you sure there is no code generator at work here?

评论 #8257413 未加载

sgt101超过 10 年前

First think is to go find the key users (start from the CEO and work down) and find out what is important that it does to them. Map that.Find anyone technical who is still around and can talk sensibly about it and find out what they think is important. Map that.Use anything automated to map what it's up too (calling...) and find out where the core of it is.You may know what is important by this time, you will be able to make some sort of start...

weinzierl超过 10 年前

Large projects tend to accumulate lots of unused cruft. Coverage tools like emma/jacoco but I have successfully used UCDetector [1]. It's not bullet proof but it helped me lot when analyzing code to remove the unused parts, even if there are false positives sometimes. [1] <a href="http://www.ucdetector.org/" rel="nofollow">http://www.ucdetector.org/</a>

fiatmoney超过 10 年前

The hardest part is often figuring out what the inner loop actually looks like. The best way to find it is to hook up a profiler, and look at a bunch of stack traces. That'll let you find the most common entry points and calling patterns, which will go a long way towards understanding it.

mokeefe超过 10 年前

Read this, then try to convince them not to re-build from scratch: <a href="http://www.informit.com/articles/article.aspx?p=1235624&seqNum=3" rel="nofollow">http://www.informit.com/articles/article.aspx?p=1235624&seqN...</a>

ufmace超过 10 年前

A few ideas that have worked for me in the past:Map the control flow. This code/app/whatever is doing something in production right now. What tells it to start? How does the control flow from the start point to the stuff that takes in data to the stuff that writes the output or does whatever this app does? Whatever the options for how it works are, where are they set, how do they make it into the core of the application to affect whatever it does?Map the data flow. Input must be coming into this thing somewhere. Find where it reads it in, where it writes it out, and how it gets from one to the other, what data structures and methods it passes through on the way.

atlantic超过 10 年前

1) determine the use cases of the different applications involved (ie, what is each one used for, how does it fit into the company's workflow)2) treat each app as a black box, understand the major data flows involved (which data sources is it interacting with? is it doing reads or writes? which tables)3) treat each app as a black box, and try to understand the interactions between each app and any external components (other apps, web services, etc)4) identify the overall architecture, determine the class hierarchy for each app, identify the major classes and functionalityBy this stage you should be ready for a rewrite. At no point do you need to go into the code in any great depth.

mping超过 10 年前

There is no programmatic way of doing this. You need to have guys with domain expertise to help you through. Obviously, the effort to fix/migrate will always be proportional to the time it took to create such a mess.This is 100MM lines, it never will be easy. I would take some time to create the tooling to do this. Say, create a tool to add some bytecode to generate a pretty callgraph. Then, I'd run the use cases or functionalities individually, and save the callgraph somewhere. But in the end, you will always need domain knowledge expertise to guide you through the logic of it.

aragot超过 10 年前

From the "decentralized web"/Agile spirit: Keep the original app online, separate it in several functional domains, and replace them progressively, month after month. This way, each iteration is a small manageable chunk, functional experts can have a complete understanding of their own scope, and the result is a set of independant scalable webapps with a clearly defined scope.... assuming you have webapps.

darrelld超过 10 年前

Start at the main function. See how things get setup and walk through the code from there. Keep notes on the structure and flow of things (if any). There isn't really an easy way to do this unless it had been documented properly before.AstroGrep is a good Windows based tool that allows you to search within file so you could use it to find which files spit out a particular output to screen.Not sure what you mean by using ASTs though.

评论 #8257403 未加载

PeterisP超过 10 年前

Divide and conquer.Find a way to split it into something like 10 pieces of 10m LOC each in a way where you can understand and [re-]document the data and control flow between them.Repeat with further subdivision, as much as you have people.Then, if you really need to re-build this from scratch, do it per component - first, make automated integration tests for its functionality, and only then attempt to rebuild that part of the system.

joshdance超过 10 年前

I don't know if you want to re-write it. Having to identify all the use cases for a system that large will be horrific. Especially since people have build onto the broken ways. And a large customer will require that it be exactly like it was before so you have to recreate the old broken way and the new shiny way. Who decided it needed to be re-written?

bilalhusain超过 10 年前

Please elaborate. Why is there need for extracting ASTs? Is it a single 100M line of Java source file? AFAIK Java has limitations on method size. If code is already organized into file and methods try to come up with some sort of UML representation. (I am assuming you are trying to understand the code base, not profiling or doing code analysis.)

评论 #8257404 未加载

markc超过 10 年前

Several years ago I saw an impressive demo of an analysis and refactoring tool for large Java codebases called SonarJ (now Sonargraph) by hello2morrow. There are a few other tools in this category (jdepend, agilej, jarchitect). They can give you dependency graph visualizations to help untangle the spaghetti and grok the higher-level structure.

V-2超过 10 年前

As for your first step - with a project that size (and thus of substantial age, I assume), there's surely tons and tons of dead code. I'd throw it away first. Slim the thing down. The very process of identifying unused code will already familiarize you roughly with the conceptual "shape" of the application

ramon超过 10 年前

If you're a UML guy here's a great option <a href="http://www.altova.com/umodel/uml-reverse-engineering.html" rel="nofollow">http://www.altova.com/umodel/uml-reverse-engineering.html</a>

Igglyboo超过 10 年前

What are you actually trying to accomplish? "Analyze" is a broad term.

bettynormal超过 10 年前

with 100 million lines of code I would: 1. Find out what is still used, remove the rest. 2. Split code into stand alone supportable units. Applications/Libraries etc.. 3. Rank units in order of new requirements and what code will need to be changed. 4. Divide code between teams. 5. Get code to build, pass any tests and match the last released versions. 5. go back to management and get them to let you hire lots of people. A person per million lines would be very low... 6. Learn code in order of need.

stuaxo超过 10 年前

Start with a static analysis tool - it will find lots of small possible bugs, by fixing them you will get good coverage of the code and insight into the structure + it will be better at the end.

andrewljohnson超过 10 年前

The code length has to be overstated, by including libraries, generated files, or data files. Is any real code that long?I bet the core java code the team actually wrote is 2 orders of magnitude smaller.

评论 #8259775 未加载

评论 #8258494 未加载

评论 #8257586 未加载

jhawk28超过 10 年前

Look into Structure101 (<a href="http://structure101.com/" rel="nofollow">http://structure101.com/</a>) to use static analysis to see the structure of the application.

bosky101超过 10 年前

if you can run the program, i highly recommend searching HN for strace.eg: "whats that program actually doing. start with strace" <a href="https://blogs.oracle.com/ksplice/entry/strace_the_sysadmin_s_microscope" rel="nofollow">https://blogs.oracle.com/ksplice/entry/strace_the_sysadmin_s...</a>i've successfully reverse engineered messaging protocols, written drivers for a different language, and ported large projects just by trying to see what it does over the file system, network.~B

mattgibson超过 10 年前

Just curious: what application needs 100M lines of anything?

评论 #8257423 未加载

ramon超过 10 年前

On Eclipse <a href="http://www.nwiresoftware.com/products/nwire-java" rel="nofollow">http://www.nwiresoftware.com/products/nwire-java</a>

jebblue超过 10 年前

jvisialvm comes with the JDK, I'd start there with profiling:<a href="http://visualvm.java.net/profiler.html" rel="nofollow">http://visualvm.java.net/profiler.html</a>Edit: Adding, I'd set some judicial breakpoints in the hot spot areas identified through profiling along with some System.out.println's (or better dump to a flat file database, SQL can be used to work wonders for analysis even for flat file data).

mschaef超过 10 年前

You need to have a clear understanding of the point of the analysis before you analyze anything. What, specifically, does your team have to produce? How much time and how many people do you have to complete the work? If you're the leading edge of an effort to rewrite 100MLoc, my presumption is that your deliverable is mainly a 'gross anatomy' of the system... a basic description of the major structural components and how they interact with each other. If that's the case, I'd start by looking at the build scripts and the modules they build. Try to make a comprehensive list of major components. You'll get it wrong initially, but you'll need a starting point.The next thing I'd do is take the top level list of modules and start assigning it to individual people within the team. Their responsibility is to produce some kind of top level description of how the individual modules work. A big part of this phase of the effort should be meetings or informal conversations as the per-module analysis progresses. As your team talks among itself, you should be able to find commonality between modules, communication links, etc. The key at this point is to keep it high level, and avoid getting too bogged down in the details. With this much code, there are plenty of details to get bogged down in. As a result, you'll probably have some mysteries about how the code actually works beneath various abstraction layers. Make and update a list of these 'mysteries' and keep it next to your team's list of modules. As you work through the list of modules, some of these will solve themselves, and some will be so obviously important that it's worth a detailed deep dive to really understand what's happening. Either way, there will be times that you have no idea what's going on in the codebase and you'll just have to trust that you'll figure it out later.One final comment I'd like to make is that, as silly as SLoC is as a measure of the size of a software system, you're looking at a large software package. (Bigger than Windows, Facebook, Linux, OSX, etc.) If you take each line of code to have cost $5-10, then the system arguably cost $1B to build in the first place.Because of the size of the system, you shouldn't expect your analysis work to be easy, fast, or cheap. Buy the tools you need to do the work. This means technical and domain training, software, hardware, process development, new staff,... basically whatever you need to make the work happen. You're at the point where long term investments are highly likely to pay off, because your scope is so large and your timeline is entirely in front of you.I'd also highly recommend working this problem from two angles. You can understand the existing system by looking at the code, but you also need to clearly understand the system requirements from the 'business' point of view. If you're doing bottom-up analysis, then some other group needs to be doing top-down. Along those lines, you should also start to thinking about deployment strategies. I highly recommend avoiding a big bang deployment of that large of a system, so there will be some period of time when you're liable to be running both the 'old world' and the 'new world' systems at the same time. Think about how you want to do that...There is lots to think about here, because this is a complex problem. Hopefully, I've given you at least a little bit to think about. Good luck.

andrewchambers超过 10 年前

rebuild 100M lines from scratch? sounds impossible to me.You can re implement the applications that are causing problems maybe one at a time maybe.I don't understand what you want to get syntax trees for, but it sounds like you are gonna need to store them in a database and do queries on it if there really is info that you need.

dmead超过 10 年前

why would you need the ast?if given a large chunk of code to maintain i'll usually run doyxgen on it to generate the xmlish kind of chart that it makes. at least it gives me a roadmap to start, but it's not super great.

评论 #8258143 未加载

mbrodersen超过 10 年前

Read "Working Effectively with Legacy Code".

sgt101超过 10 年前

Also - what is the history of getting into this mess?

DonPellegrino超过 10 年前

check out <a href="https://github.com/facebook/pfff" rel="nofollow">https://github.com/facebook/pfff</a>

clavalle超过 10 年前

I would build a general profile of the application and then drill in as needed rather than try to grok the whole buffet at once.If the idea is to rebuild the application I would start at the beginning: what is the input and output? What does the user see? What are the various service hooks? How are they called? When are they called? Why are they called?Then I would look at how the overall code is organized. What modules are there? Are there core utility modules that seem to be called by everything else? What are those doing? What are the most used business function modules?Then I would look at the build process. What external dependencies are there? What are they used for? Are there modern alternatives? What about internal dependencies? Does the build process look organized and sane or a chaotic mess cobbled together over the years?Do you have logs? What is the most utilized part of the application?Then I would look at the database. What tables seem to be the most important (if you could get usage stats from a running and used application that could help, but otherwise you could look at which tables are keyed off of the most)? What data is most critical? What modules interact with that data? What tables are essential for supporting this data?Answering these questions will start to fill out a nice 30,000 ft view of the application and how it is actually used.You are going to get the most bang for your re-implementation buck by identifying and replacing often used utilities (especially if they are custom built or built before a good de-facto standard was formed for that particular task) with modern, well known, alternatives. Then follow the execution path of the most often used modules and the modules that work with the most critical data and work down the list.With a 100 million line application, you are looking at many years to understand all of it and many years to re-implement. To get anything useful in a reasonable amount of time you are going to have to boil it down as much as possible, then break what's left down into independent functional areas and tackle it an area at a time.The code is important, but if it were me, I'd try to analyze how the users and processes work before I'd dig into the nitty gritty of the code too much if at all possible. I'd build the smallest functional unit from what I deem to be the most important and critical module(s) trying to cut as much cruft from the application and database as possible. I'd get users and processes to start banging on the new app as soon as possible. I'd keep the old application up and running and available to analyze (not for the users but for the developers and analysts) as the team works down the most often used parts. I would not try to analyze the whole mess in one go beyond finding waypoints as described above. If possible I'd also try to get users to understand that the old way is not necessarily the right way. Much pain has been caused trying to make new systems work exactly like the old systems when the new systems don't face the same constraints. It is just too tempting to say 'make it work like it did'.

mathusan超过 10 年前

box

slermukka超过 10 年前

You could try to reduce the lines of code by removing alle the useless whitespaces.

50 条评论

sergiosgc超过 10 年前

评论 #8259296 未加载

评论 #8259732 未加载

评论 #8259862 未加载

评论 #8261206 未加载

goshx超过 10 年前

评论 #8259653 未加载

评论 #8259968 未加载

评论 #8260098 未加载

评论 #8260119 未加载

michaelvkpdx超过 10 年前

评论 #8257527 未加载

logn超过 10 年前

评论 #8257466 未加载

bjackman超过 10 年前

You haven't really described your goals: What do you want to extract from your analysis? Metrics to tell you what's "wrong" with the existing code base? Some sort of model of the system's semantics?

评论 #8257401 未加载

评论 #8257396 未加载

rashthedude超过 10 年前

What type of application has 100 million loc's? Windows 7 has 40 million lines of codes so I'm wondering why type of application/software it is.

评论 #8257521 未加载

cyrillevincey超过 10 年前

评论 #8257923 未加载

EtienneK超过 10 年前

mml超过 10 年前

Funny, I have 50,000 individual Java apps to analyze. I started with a copy/paste detector. Pmd has a free one. Good luck!

评论 #8259277 未加载

sp332超过 10 年前

radicalbyte超过 10 年前

评论 #8257637 未加载

xradionut超过 10 年前

dugmartin超过 10 年前

评论 #8259377 未加载

myang超过 10 年前

FollowSteph3超过 10 年前

Koziolek超过 10 年前

jacquesm超过 10 年前

Callgraph. Then document the larger chunks, working your way down.It's like having a map versus having no map at all.And 100M lines? Are you sure there is no code generator at work here?

评论 #8257413 未加载

sgt101超过 10 年前

weinzierl超过 10 年前

fiatmoney超过 10 年前

mokeefe超过 10 年前

ufmace超过 10 年前

atlantic超过 10 年前

mping超过 10 年前

aragot超过 10 年前

darrelld超过 10 年前

评论 #8257403 未加载

PeterisP超过 10 年前

joshdance超过 10 年前

bilalhusain超过 10 年前

评论 #8257404 未加载

markc超过 10 年前

V-2超过 10 年前

ramon超过 10 年前

If you're a UML guy here's a great option <a href="http://www.altova.com/umodel/uml-reverse-engineering.html" rel="nofollow">http://www.altova.com/umodel/uml-reverse-engineering.html</a>

Igglyboo超过 10 年前

What are you actually trying to accomplish? "Analyze" is a broad term.

bettynormal超过 10 年前

stuaxo超过 10 年前

Start with a static analysis tool - it will find lots of small possible bugs, by fixing them you will get good coverage of the code and insight into the structure + it will be better at the end.

andrewljohnson超过 10 年前

评论 #8259775 未加载

评论 #8258494 未加载

评论 #8257586 未加载

jhawk28超过 10 年前

Look into Structure101 (<a href="http://structure101.com/" rel="nofollow">http://structure101.com/</a>) to use static analysis to see the structure of the application.

bosky101超过 10 年前

mattgibson超过 10 年前

Just curious: what application needs 100M lines of anything?

评论 #8257423 未加载

ramon超过 10 年前

On Eclipse <a href="http://www.nwiresoftware.com/products/nwire-java" rel="nofollow">http://www.nwiresoftware.com/products/nwire-java</a>

jebblue超过 10 年前

mschaef超过 10 年前

andrewchambers超过 10 年前

dmead超过 10 年前

评论 #8258143 未加载

mbrodersen超过 10 年前

Read "Working Effectively with Legacy Code".

sgt101超过 10 年前

Also - what is the history of getting into this mess?

DonPellegrino超过 10 年前

check out <a href="https://github.com/facebook/pfff" rel="nofollow">https://github.com/facebook/pfff</a>

clavalle超过 10 年前

mathusan超过 10 年前

box

slermukka超过 10 年前

You could try to reduce the lines of code by removing alle the useless whitespaces.