I am a GNOME sysadmin.. and what have I done?

As some of you might know, as of a few months (Mango says the 14th of June 2013), I am a system administrator (sysadmin) for the GNOME project.

I thought it might be nice to give a short overview of what I have done since then.

How did I join?

So let me start with the beginning: what got me to join?
It all started when at the 13th of June, Andrea Veri (known as av on GIMPNet or averi on Freenode) pinged me: he wanted to set up a copy of Fedora Status for GNOME.

During this conversation, I asked him if you would need new help on his team, as he said he was the only active sysadmin at the time.

He told me that normally he would be very careful with that, but that he trusted me because of all my work for the Fedora Infrastructure team (I have written the new OpenID backend (new updates coming really soon with some long-awaited features), maintained our wiki, and more).

So that is when he sent an email to the email list asking if anyone had any objections, but most people with sysadmin permissions said they trusted his judgement, and so I was accepted.
In the next few days, we setup my account, with the permissions and everything.

First actions

So after joining, the first thing I did was just checking out some of the servers, reading the Server list, and getting accustomed to everything.

One of the first things I actually did was setting up a reverse proxy setup for a lot of the GNOME services.

This change got some people angry, but for our internal setup it was very useful: we could move the certificates and their private keys from a lot of different servers to just two (proxy01 and proxy02), and were now able to quickly switch which server was hosting a specific service without needing to modify the DNS or anything public-facing.

Also, I helped with some routine maintenance like service restarts, etc.

Getting access beyond ssh

During one of the mass reboots (where we just reboot all servers to get them to apply new kernel updates etc), one server took a very long time before coming back up.

This made us realize it would be nice to get access to at least power management for "our" servers (the servers that Red Hat hosts for GNOME), so we could reboot them ourselves whenever something bad would happen.

At this point, Stephen Smoogen got me access to the KVM switch of that rack, so we were able to at least see what the servers were doing during reboots (thank you Stephen!).

After some time, we also found out that all of the servers Red Hat hosts for us contain some form of remote management cards (either Dell DRAC, IBM RSA2 or Cisco CIMC), which we got linked and configured.

So after this, we could both see the remote consoles through the KVM, and reboot the servers through these cards, which is extremely handy.

Off course during the next mass reboot the only server which did not have its management configured did not come back up, but this was fixed pretty fast after filing a ticket with the Red Hat servicedesk, and we can now safely reboot all servers, knowing we can fix about everything that might happen to the machines by using the management cards.

Getting serious in uptime

Let me skip some of our projects to one of the latest things: getting more serious in uptime and contact with our "users" (which mostly means: people developing GNOME and the subprojects).

We have had a Nagios instance for quite some time so far, but for quite some time there would only be a response to an issue if one of us saw either the email or the notification on IRC.

Also, if anyone else had noticed the issue while we had not, chances were pretty big they only wasted time with doing that, because if we did not see the Nagios ping on IRC, neither of us would be on IRC, and thus also not notice any pings from people trying to notify it was not working.

Sometimes, people would find the email address of Andrea, but then it would only go to him, and as such would not be acted upon if Andrea was busy for some reason.

Another issue was that a lot of sysadmin projects were registered nowhere: they were only in the mind of the person who was available on the time someone requested it.

As such, people would have to wait for that person to return to continue with their projects (like the recent migration of Yorba projects to the GNOME infrastructure).

So we wanted to fix these issues, and improve our service to our users.

What we decided was to first switch all registration of "long-lasting" sysadmin projects to our Request Tracker instance, so that either of us could continue where the other left of.

Next, we setup a service that helps us managing an on-call schedule, to solve the issue of not knowing who was to answer to what reported issue: the one who gets paged is responsible for the issue getting fixed: if he wants to request help from the other, that is fine, but he is responsible for getting it fixed.

To make sure we would get notified of any issues, we setup Nagios to forward any issues reported to this service.

To make sure we would receive notifications of downtimes of Nagios itself, we setup an external monitoring service to check Nagios, and also email the on-call service.

We also went forward and created an email where people can send emails too in case both of those systems do not work, or there is something like a crticical issue which is not detected by Nagios.

We also setup an email address for reporting any security issues with the infrastructure: emails sent to this address are only readable by me and Andrea.

Conclusion

I hope this blog post gave you a feeling of what I and the GNOME sysadmin team have been doing recently.

I would like to thank all the people who helped me getting here, especially Andrea Veri for trusting me and giving me the chance to prove myself.

If you have any questions regarding this post, do not hestitate to email me at .

If you ever need help from the GNOME system administrators, please follow the instructions on .

Thank you for reading this post, and feel free to let me know what you thought of it!

RPM-OSTree, a new deployment mechanism for RPM-based systems

Another reason to love ThinkPads