At CAPSiDE, our main specialty is managing our clients’ systems so they can operate 24/7 providing every availability, security and performance guarantees. From our own experience, we know that many different realities hide behind the concept of “system administration”, depending on the company providing these services. For that reason, we prefer using the term “system engineering” instead, to highlight the existing differences in our service.
System Engineering (or system-oriented) is for us a sum of:
carrying out their job by following a unique and shared methodology that will ensure a uniform and consistent quality response for the customer.
We don’t only talk about a brilliant technical team but also of homogeneous procedures, information sharing and automation tools that make it easier for each person to respond, minimizing learning and adaptation times.
Engineers, not only administrators
Our engineering work starts when we take responsibility for an existing platform or when we design one from scratch.
On the first case, we carry out a system audit oriented to have the most accurate vision of the middleware installed on the plataform, to prepare its adaptation to our processes.
On the second case, the first important work is the architecture design as I explained in the previous blog article “Architects of the digital Society”. Then, we install the middleware layer with automated procedures to have our pre-configured administration processes directly (start and stop services, log rotation, security configuration, backups, etc.). In both cases, the process is always carried out through fluid communication with our client, as an important work of manual adaptation configuration may be required depending on the type of application and its derived use.
One of the most important processes in the management we carry out is the monitoring services setup. Monitoring allows us to have online data and historically.
“With no data, you can’t understand the situations encountered or making decisions.”
Paradoxically, many companies don’t even have a service that allows them to detect critical situations before becoming an actual problem. Many discover the incidents once they have been effective, with all the negative consequences derived from it as:
- An extra effort to fix them
- Bad user experience
- Unique dedication to “extinguish the fire” by the technical team
- Inability to conduct improvement projects
At system engineering, we focus the monitoring service on the incident prevention over detection, warning of trends that may jeopardize the service stability in advance. We also extend monitoring on different levels of abstraction, allowing technicians to have a technical vision of the services and the business users to have a higher-level vision with KPIs and SLAs (Business Monitoring).
Tasks that make up systems engineering
The day-to-day of system engineering is made of multiple tasks, focused on providing answers to specific needs of our service or our customers’. The origin of the tasks may come from different parts, depending on whether they are reactive (mainly incidents), proactive tasks (launched by engineers), routine (scheduled ones) or requested by the customer.
An incident warning can be originated directly through a monitoring alert or by a customer’s alert. They are classified according to the systems affected and their criticality.
An incident is considered as such when it implies an interruption or deterioration of the service offered by the platform. We guarantee the incident resolution 24 hours a day, 365 days a year. Once we receive the incident, our engineers look over the problem, they diagnose the cause and work on it immediately to restore the service.
These are tasks carried out by the system engineering team in response to monitoring warnings to ensure service security, performance and availability, such as: log analysis and/or system events, setup changes, etc..
The important aspect of these tasks is planning and tracking.
When the number of systems increases, it is essential to use planning tools that can guarantee the automatic opening of tasks to be performed to the groups assigned.
Routine tasks include starting patches, doing backups, performing backup recovery tests, etc. Among all the tasks, we highlight two, a differential in our service:
Routine backup recovery testing
Many system administrators merely do regular backups, they never test the recovery processes. By default, our services include a quarterly test recovery of all types of data included in the backup (databases, files, etc).
Servicio de patching
The goal of this service is maintaining the security level of the Middleware managed on the client’s platform. NIST databases are used as a reliable knowledge base for receiving alerts with their level of risk and thus being able to act accordingly.
On receipt of a warning, the affected software versions go through each client’s inventory and the reports used to perform the patching tasks (Patch Management) are generated dynamically, using only relevant information for each customer.
Las peticiones de servicio son peticiones formuladas por el cliente a través de los canales estándares de soporte para realizar tareas concretas sobre los elementos de servicio de la plataforma. Las peticiones de servicios comprenden tareas de administración/gestión de sistemas y aplicaciones sobre la plataforma gestionada. Estas tareas son habitualmente cambios de configuración, añadido de funcionalidades, etc.
“All our processes are designed, documented and are run following the ITIL management standards.”
Team of specialists’ organization
Another fundamental and differentiating aspect of our service is the organization and operation of the engineers’ team. We’ve wanted to stay away from the traditional support model in several Tiers of expertise that don’t respond to our understanding of customer expectations, or those of the technical team itself:
- The customer requires that its first spokesperson provides direct support instead of being a mere case annotator.
- The case scaling process between Tiers damages the response time to the final client.
- The staff’s technical level on Tiers doesn’t evolve, as there isn’t any regular rotation between Tiers due to limitations in the technicians’ capacity.
“In our model, there are groups with a small number of engineers that support, carry out projects for clients or continuous improvement tasks of internal tools.”
Engineers keep rotating groups with an established frequency, which allows each one of them to have knowledge in all technical areas within the company and being in touch with all the clients’ projects. Regular training activities allow each engineer to acquire common knowledge based on clients, tools and processes, and a degree of specialization in each one is enhanced through external training and certifications (Oracle, Microsoft, Linux, etc).
The organization’s flexibility allows to easily respond to specific demand peaks in certain areas by temporarily moving engineers from one group to another without training or adaptation time. To face growth in the number of customers, small groups are multiplied in each area (support, project, internal improvements, etc.) with a specialization in specific clients and extending the rotation of engineers between these groups.
The result of this mix of processes, people and tooling is what we call system engineering, a model that so far it has proven its validity considering the good results achieved with our clients.
About the author
Thierry Davin is a computer engineer and Senior Consultant at CAPSiDE. He started his professional career in France in 1991, the year Linux was born. After several years in Software development, he focused on system administration until early 2001 to enter the world of hosting-related services. Now at CAPSiDE, he focuses on new services development.