Ask HN: How do you document and keep tabs on your infrastructure as a sysadmin?

156 pointsby redsecabout 7 years ago

I am wondering how do experienced sysadmin document and manage their infra.

35 comments

vinceguidryabout 7 years ago

When I was working as a sysadmin, I kept a spreadsheet. I was told later of a repository of information that supposedly was what my spreadsheet did, but it didn't add anything new and was much harder to keep up.I built it up using nmap and then shelling into each individual machine and poking around to see what it did. This was back in the days before everything became virtualized, so each machine on the network was likely physical.I added information by walking the aisles and copying down the rack location of every machine into another page on the spreadsheet. I eventually hooked up a terminal to them all and matched network addresses to physical machines.Only took a few weeks and when I was done, I knew things about the network that guys who worked at the business for years didn't know.There's no substitute for the good old-fashioned way.I liked that job, it was fun.

评论 #16543740 未加载

cikabout 7 years ago

We use Collins (<a href="https://tumblr.github.io/collins/" rel="nofollow">https://tumblr.github.io/collins/</a>) as a Configuration Management Database, Ansible (<a href="https://www.ansible.com/" rel="nofollow">https://www.ansible.com/</a>) for automation, Terraform (<a href="https://www.terraform.io/" rel="nofollow">https://www.terraform.io/</a>) + a bunch of homebrew for orchestration, Packet (<a href="https://www.packer.io/" rel="nofollow">https://www.packer.io/</a>) for multi-cloud (and hypervisor) image creation and maintenance, powered by Ansible. Every since thing is committed to a series of bitbucket (<a href="https://www.bitbucket.org" rel="nofollow">https://www.bitbucket.org</a>) repositories.We connect Ansible and Collins through ansible-cmdb (<a href="https://github.com/fboender/ansible-cmdb" rel="nofollow">https://github.com/fboender/ansible-cmdb</a>), then tie the entire thing to our ticketing systems ServiceNOW (<a href="https://www.servicenow.com/" rel="nofollow">https://www.servicenow.com/</a>) and Jira Service Desk (<a href="https://www.atlassian.com/software/jira/service-desk" rel="nofollow">https://www.atlassian.com/software/jira/service-desk</a>), and finally, ensure we have history tracking with Slack (<a href="https://www.slack.com" rel="nofollow">https://www.slack.com</a>).As a given, we yank test the entire world. If it doesn't pass a yank, it straight up doesn't exist.Whether it's bare-metal, virtualized, para-virtualized, dockerized, mixed-mode, or cloud - we 100% do this all the time. There is not a single change across any environment, that isn't fully tracked, fully reproducible, fully auditable, and fully automated.

评论 #16537133 未加载

majewskyabout 7 years ago

- keep inventory in a DCIM (we use Netbox)- configure everything as code (we use Ansible for the infrastructure up to OS level, Kubernetes w/ Helm for applications), have it read the values from the DCIM so that the DCIM remains the single source of truth (we need to still get better on this part....)Links: <a href="https://github.com/digitalocean/netbox" rel="nofollow">https://github.com/digitalocean/netbox</a> <a href="https://www.ansible.com" rel="nofollow">https://www.ansible.com</a> <a href="https://www.kubernetes.io" rel="nofollow">https://www.kubernetes.io</a>That's at work. At home, I do much of the same, except that maintaining a DCIM is excessive for 2 VPS and a home network of 3 boxes.

评论 #16536707 未加载

zieabout 7 years ago

I'm going to mostly disagree with everyone here, much to my karma's detriment ;PI agree the end-goal should be infrastructure as code, and everyone here has covered those tools well. You also want monitoring across your infrastructure. Prometheus is the new poster-boy here, but the Nagios family, and many other decent OSS solutions exist as well.But you still need documentation. Your documentation should exist wherever you spend most of your time. Some examples:* If you spend most of your time on a Windows Desktop, doing windows admin type things, then OneNote or some other GUI note-taking/document program makes sense.* If you spend most of your time in Unix land(linux, BSD, etc) then plain text files on some shared disk somewhere for everyone to get to, makes WAY more sense. Bonus if you put these files in a VCS, and treat it like code, and super bonus if your documentation is just a part of your Infra as code repositories.* If you spend your time in a web browser, then use a Wiki, like MediaWiki, wikiwiki, etc.In other words, put your documentation tools right alongside your normal workflow, so you have a decent chance of actually using it, keeping it up to date, and having others on your team(s) also use it.We put our docs in the repo's right alongside the code that manages the infrastructure.. in plain text. It's versioned. We don't publish it anywhere, it's just in the repo, but then we spend most of our time in editors messing in that repo.

评论 #16537436 未加载

评论 #16541600 未加载

antoncohenabout 7 years ago

It might be helpful if described your infrastructure. There is a pretty big difference between managing physical Windows servers in a data center and managing Linux servers all in AWS.If you are all or mostly cloud, Terraform + config management with a CI pipeline takes care of a lot. Then a wiki that covers "Getting Started" and a few how-to articles.For physical infra you need the setup for DHCP, updating DNS based on DHCP, PXE boot imaging, IPMI access and configuration, switch and router configuration, what servers are connected to which switch ports, PDU management and monitoring, and on and on and on.You end up with something like NetBox (<a href="https://github.com/digitalocean/netbox" rel="nofollow">https://github.com/digitalocean/netbox</a>) or Collins (<a href="https://tumblr.github.io/collins/" rel="nofollow">https://tumblr.github.io/collins/</a>), plus a bunch of other stuff gluing things together.

评论 #16536908 未加载

beh9540about 7 years ago

I think it depends a lot on the size of your infrastructure. I've used excel docs on a shared drive pretty successfully where there's not much to keep up on and changes are few.In larger infrastructure setups (small service provider) we used a combination of netboot, SNMP for monitoring with Observium and Nagios for alerting. We were also a big VMware environment, so naturally we had a lot of inventory tracking available through vCenter as well. I found a lot of opposition to Configuration Management, given the lack of comfort with programming of some sysadmins (Windows admins), so that's something to keep in mind as well. I think mixed environments also can be challenging w/infrastructure as code, but I'd be interested to see how others get through that.

seorphatesabout 7 years ago

The past decade has been interesting and I'm still processing it.My current thoughts are that an appropriate approach is for your systems to document themselves via the applications that they run - inside out.Though I must abide I cannot fully subscribe to "infrastructure as code" anymore. It has proven just another shift, primarily in toolsets and who (or what) gets say and sway over the capacity, capabilities and efficiencies of the thing you actually care about - the app stack and all of its assembled functionality.In other words most approaches are still "outside in" - one defines 'x' for deploy fitments and that typically over and over and over again and, typically, with a rigidity that can too easily override and overrule effectively caging your application in scale and scope. With my current tact I am trying to provide for 'y' to "self identify" (via some/any form of config mgmt) where from here you can begin to effectively "deploy to any" by hooking the "application config as code" that, in turn, defines its infrastructure and deploys "outward". The "infrastructure as code" then becomes the servant with its objects and platform definitions etc. and the "appconfig as code" becomes the master where the latter defines its own scope and scale.Infrastructures have a funny way of mutating into inefficient "definitions" of something that once made sense, on the first day, and forevermore complicating progress with capacity, rules and opinions.But, generically, snmp is still pretty cool for telling me what I need to know. Strapped that into any end engine and, boom, ask any question, request any inventory.So.. I track apps, not systems. Systems are expendable, applications are not.

brudgersabout 7 years ago

I don't do devOps but if I did...<a href="http://howardism.org/Technical/Emacs/literate-devops.html" rel="nofollow">http://howardism.org/Technical/Emacs/literate-devops.html</a><a href="https://www.youtube.com/watch?v=dljNabciEGg" rel="nofollow">https://www.youtube.com/watch?v=dljNabciEGg</a>

itomatoabout 7 years ago

There are several classes of "infrastructure" as a sysadmin; legacy, new and critical.Legacy stuff is done the old fashioned way - portscans and nmap. If it has an open port, it's presumed to be intentional. If not, it's a target. I've seen some success using tools like Pysa to "blueprint" existing systems into Puppet code. Tools like SystemImager help here, too - enabling P2V and the creation of "file-based images" compatible with version control and able to PXE boot new clones.New stuff is from-scratch IaC all the way to the metal. Ansible and git submodules help me build "sandwiches".Critical stuff blurs the lines. The machines, IP addresses, ports and living connectivity can be documented, and "captured" to a limited extent with the manual mapping and Rsync stuff in the Legacy category. Some of this critical stuff is also "new", and is deployed in that fashion.What about switchgear and Cisco configs? License strings, key management, site-specific patching - all can complicate things.More important than any of these is the ability for you and those around you to see and manage the systems as they are launched and terminated.In the old days, I used to use a shell script on a newly-provisioned host to dump all its' details - dmidecode, environment stuff and so on. Those details were pushed back to a common source and were a real benefit in the days before real config management came on the scene. CFEngine was way too complicated and nebulous at the time.

falcolasabout 7 years ago

For me/us, it's a combination of infrastructure-as-code and metrics reporting/logs. Most of our boxes are swapped out on a weekly or more frequent basis, so the only accurate picture of what's running right this moment is the graphs built by the metrics collection tools. The only accurate picture of what's running on those boxes is the code which built the infrastructure.There are a couple of exceptions, but those are actively being brought under the above model (mostly because they are effectively invisible, and the existing documentation for them is... incomplete).Any documentation outside of that is stale in a few hours, and obsolete in a week.

jcadamabout 7 years ago

Back when I was put in charge of IT Lifecycle management for my Army unit (not by choice - "Hey, you've got a CS degree, so anything tech related goes to you"), I kept it all in an Access Database, and ran off a report occasionally to update my smartbook (3-ring binder full of stuff that my boss would frequently ask about during meetings). Granted this was back in the early 00's.

owaisloneabout 7 years ago

Terraform + Datadog + Cloudwatch<a href="http://terraform.io/" rel="nofollow">http://terraform.io/</a> <a href="http://datadoghq.com/" rel="nofollow">http://datadoghq.com/</a> <a href="https://aws.amazon.com/cloudwatch/" rel="nofollow">https://aws.amazon.com/cloudwatch/</a>

atsaloliabout 7 years ago

As a professional sysadmin, my go to reference on this is "Documentation Writing for System Administrators", from the Short Topics in System Administration series.<a href="https://www.usenix.org/short-topics/documentation-writing-system-administrators" rel="nofollow">https://www.usenix.org/short-topics/documentation-writing-sy...</a>Also, this talk was very good:<a href="https://www.usenix.org/legacy/event/lisa08/tech/gelb_talk.pdf" rel="nofollow">https://www.usenix.org/legacy/event/lisa08/tech/gelb_talk.pd...</a>

评论 #16540782 未加载

allsunnyabout 7 years ago

I've used <a href="https://www.racktables.org" rel="nofollow">https://www.racktables.org</a> with pretty good luck. It's PHP, which wouldn't be my first choice, but I've largely been able to make it do what I want.If you want something more clever; say keeping track of asset values etc, you'll want a CMDB. Google around and you should find something that fits your needs. We used SeviceNow in a previous life.

paydroabout 7 years ago

We put everything in code. We have several layers, but they if you're new you can start with the lowest level and make your way up to find out how things are provisioned and configured.We're on AWS so we use cloudformation for provisioning and saltstack (<a href="https://saltstack.com/" rel="nofollow">https://saltstack.com/</a>) for configuration management. Cloudformation templates are written using stacker (<a href="http://stacker.readthedocs.io/en/stable/" rel="nofollow">http://stacker.readthedocs.io/en/stable/</a>). All AWS resources are built by running "stacker build" so nothing is done by hand. We have legacy resources that we're slowly moving over to Cloudformation, but more than 90% of our infrastructure is in code.On top of cloudformation and salt we built jenkins (CI and docker image creations), spinnaker (deployment pipeline), and kubernetes (deployment target). The jenkins and spinnaker pipelines are also codified in their own respective git repos.All the repos here have sphinx setup for documentation purposes and the repos tend to crosslink for references.

rbjorklinabout 7 years ago

I’ve found Zabbix works decently well and also covers monitoring. Zabbix Maps can be nice to visualize the infrastructure: <a href="https://www.zabbix.com/documentation/3.4/manual/config/visualisation/maps/map" rel="nofollow">https://www.zabbix.com/documentation/3.4/manual/config/visua...</a>

bradknowlesabout 7 years ago

So, one problem I’ve seen with most infrastructure as code solutions and CMDBs is that they do a good job at the tactical level (more or less), and help you answer “how”, “where”, “what”, and maybe “when” questions (depending on how well they support orchestration), they typically do a bad job at the higher level strategic “why” questions.So, why do you structure your lambda jobs accessing CloudWatch Logs that way as opposed to the other way? If you didn’t know that one way works and the other doesn’t, you wouldn’t be able to understand that question. And that might have domino effects on other parts of your system.I haven’t found a good solution to documenting the high level strategic “why” questions, other than to just write down the questions and the answer, with reasoning, in some form of associated documentation — maybe in a wiki or something. But, of course, the underlying issues may change in the near future and invalidate the reasons for your decision. And the high level documentation doesn’t have any way to be compiled directly into the lower level implementation, so of course there is always the risk of drift.I’m still looking for good solutions in this space.

tyingqabout 7 years ago

Vmware's tagging support is a lighter, more realistic option vs a "CMDB".Come up with a key/value strategy that covers your need to track things like app name, app category, environment (test, dev, load testing, prod, prod/dmz, etc), and it becomes actually usable and up to date versus an always out-of-date CMDB. And it's compatible with cloud resource tagging.Sometimes, less is more.

rootsudoabout 7 years ago

I use One Note.But I also use the o365 Suite.Mediawiki is also good, but can be a bore to run another service for that.But in the end a textfile via notepad/nano is all you need, really.

评论 #16541614 未加载

outworlderabout 7 years ago

Spinning up new infra: Jenkins crafts Terraform tfvars based on user input, runs plan, asks for confirmation, applies. Terraform state and vars saved to S3. Chef and Ansible for provisioning."Documentation", in terms of where stuff is deployed and what is deployed is not really necessary. We save this data to a DynamoDB table, query-able by AWS Lambda functions, so other automation can pick it up and devops can query data.Documentation on how things work comes from dev teams, on how things are deployed indeed comes from us, just simple wiki pages.Services running in Kubernetes, K8s worker instances in auto-scaling groups. If one node dies it is killed and brought up, K8s will reschedule the pods. Same for the pods themselves.Monitoring through Nagios(getting phased out finally), NewRelic and Prometheus. Basic ELK stack for centralized logs.Thinking about rolling out Vault for credential management. Chatops on the pipeline (getting pieces in place first, like the db mentioned earlier)I'm trying to get the company on board on immutable infrastructure, but it is proving difficult.

FatalBaboonabout 7 years ago

Like many here, I keep it described in ansible and documentation inside a git repository.But I feel like it's lacking. After a while you have so many ansible playbooks and roles that they cannot give you a birds-eye view anymore.I think I would MUCH prefer to have some sort of HTML representation, where adding an instance/service starts by adding to that representation, and you could click on every link or node to show its golden image setup, ansible configuration, etc.THAT, I could show to a newcomer and he'd get it.

评论 #16538220 未加载

评论 #16537992 未加载

richardknopabout 7 years ago

By having your infrastructure defined in version control using some sort of domain specific language. For example, by using Terraform and only ever making changes to your infrastructure via Terraform (manual adding/editing of stuff in AWS/GCP console should be disabled so people can't do that). Then all changes to the infrastructure are clearly documented in version control with pull requests.

tmikaeldabout 7 years ago

I use:- <a href="https://www.bookstackapp.com/" rel="nofollow">https://www.bookstackapp.com/</a> for portable (Markdown), searchable (SQL), manageable (Users) documentation.- Ansible for automation and deployment.- Prometheus for monitoring all the Proxmox nodes and containers.

tyingqabout 7 years ago

Aligning a VMWARE tagging strategy with a cloud tagging strategy is one of my current goals. Things like a full blown CMDB seem to always end in pain, lag, and orphaned records. I'm happy enough with something basic that spans on-prem + cloud.

tootieabout 7 years ago

Can I piggyback and ask how people keep track of deployed software? Like if I have 50 products deployed some of which haven't been touched in 10 years and I want to be able to ramp up a developer to fix a bug on any of them?

评论 #16536535 未加载

evangineerabout 7 years ago

GLPI with FusionInventory for IT Asset Management and Knowledge Base.GitLab for repositories, adhoc documentation via gists and CI/CD.Nagios for monitoring.Open to trying other things out if they make sense.

peterwwillisabout 7 years ago

Asset management systems and network inventory databases.

skyisblueabout 7 years ago

Those using AWS ALB, how do you monitor your traffic in realtime? I want to aggregate host names, ip addresses, user agents in realtime.

thrownaway954about 7 years ago

Lansweeper (<a href="https://www.lansweeper.com/" rel="nofollow">https://www.lansweeper.com/</a>)

cat199about 7 years ago

anyone have any pointers for simple an API driven managment of DNS/DHCP?(like, I don't want to have to configure 1000 moving parts)typically this seems to fall into the 'roll your own' or 'giant lumbering enterprise behemoth' category that does 10 other things. I'm looking for the sweet spot.

评论 #16537384 未加载

评论 #16543312 未加载

评论 #16542324 未加载

评论 #16539313 未加载

HeadlessChildabout 7 years ago

A configuration manager, Ansible for example. You basically describe your infrastructure with it.

nunezabout 7 years ago

I deploy it with code. For hardware stuff, a CMDB, also maintained with code

dxhdrabout 7 years ago

I'm curious how ChatOps practitioners handle this.

hypnagogicjerkabout 7 years ago

What about securely storing credentials and passwords?

AdamGibbinsabout 7 years ago

I'm not sure I understand your question fully? You write documentation, like you do anything. And configure everything with code, so you can go read it (Terraform, Chef/Puppet/Ansible, etc).

评论 #16536287 未加载

35 comments

vinceguidryabout 7 years ago

评论 #16543740 未加载

cikabout 7 years ago

评论 #16537133 未加载

majewskyabout 7 years ago

评论 #16536707 未加载

zieabout 7 years ago

评论 #16537436 未加载

评论 #16541600 未加载

antoncohenabout 7 years ago

评论 #16536908 未加载

beh9540about 7 years ago

seorphatesabout 7 years ago

brudgersabout 7 years ago

itomatoabout 7 years ago

falcolasabout 7 years ago

jcadamabout 7 years ago

owaisloneabout 7 years ago

atsaloliabout 7 years ago

评论 #16540782 未加载

allsunnyabout 7 years ago

paydroabout 7 years ago

rbjorklinabout 7 years ago

bradknowlesabout 7 years ago

tyingqabout 7 years ago

rootsudoabout 7 years ago

I use One Note.But I also use the o365 Suite.Mediawiki is also good, but can be a bore to run another service for that.But in the end a textfile via notepad/nano is all you need, really.

评论 #16541614 未加载

outworlderabout 7 years ago

FatalBaboonabout 7 years ago

评论 #16538220 未加载

评论 #16537992 未加载

richardknopabout 7 years ago

tmikaeldabout 7 years ago

tyingqabout 7 years ago

tootieabout 7 years ago

评论 #16536535 未加载

evangineerabout 7 years ago

peterwwillisabout 7 years ago

Asset management systems and network inventory databases.

skyisblueabout 7 years ago

Those using AWS ALB, how do you monitor your traffic in realtime? I want to aggregate host names, ip addresses, user agents in realtime.

thrownaway954about 7 years ago

Lansweeper (<a href="https://www.lansweeper.com/" rel="nofollow">https://www.lansweeper.com/</a>)

cat199about 7 years ago

评论 #16537384 未加载

评论 #16543312 未加载

评论 #16542324 未加载

评论 #16539313 未加载

HeadlessChildabout 7 years ago

A configuration manager, Ansible for example. You basically describe your infrastructure with it.

nunezabout 7 years ago

I deploy it with code. For hardware stuff, a CMDB, also maintained with code

dxhdrabout 7 years ago

I'm curious how ChatOps practitioners handle this.

hypnagogicjerkabout 7 years ago

What about securely storing credentials and passwords?

AdamGibbinsabout 7 years ago

I'm not sure I understand your question fully? You write documentation, like you do anything. And configure everything with code, so you can go read it (Terraform, Chef/Puppet/Ansible, etc).

评论 #16536287 未加载