Operations has a broad business meaning, but I am specifically interested in IT operations (often shortened to ‘ops’ or IT ops) by which an organisation has to manage and maintain its IT systems and infrastructure to support and enable its business goals. In a modern digital organisation this takes on a more central and business-critical role which goes beyond simply keeping the lights on. Unlike a lot of the table of IT elements ops is not concerned explicitly with innovation and change, or creating new software that improves the organisation. However, at the end of this article you will see how these different perspectives are merging.
In the wider meaning Operations* are the tasks that add value by transforming resources or data into the outputs, goods or services that a Customer desires. If these are deemed to be the core functions of an organisation, e.g. processing, manufacture, transportation, storage etc., other support and ancillary functions exist to provide financial and personal services, logistics, procurement, management, marketing and sales, for example. All of the above are likely to need technology, information systems, networks or data to operate effectively, or at all.
*There is also a third meaning, in maths an operation, such as addition or multiplication, acts on data or variables, which would normally involve a transformation … so that all ties in nicely!
The following sections list some of the things that IT ops may do, in no particular order. This is not intended as an exhaustive list. I will finish with a brief mention of a framework for operations management (ITIL) and some modern developments and jargon that you might come across.
Inventory and [software] asset management
As the name suggests, an inventory is a catalogue or library of the digital assets, hardware and software that an organisation has, including licensing information. In terms of ITIL practices, this encompasses, ‘… maintaining standard policies and procedures surrounding definition, deployment, configuration, use, and retirement of software assets.’
The deployment of new or amended code is covered below in Maintenance & release management. Configuration is another piece of IT jargon, used here to refer to the tracking and controlling of software versions, i.e. Configuration Management or ‘CM’. Retirement is the removal or decommissioning of old or obsolete code – although this doesn’t always happen very effectively in organisations.
This is such a big topic that it has its own dedicated IT element. Organisations have legal, regulatory and ethical responsibilities to its staff, customers and partners to protect sensitive data, guard against fraud and malicious attacks, including denial of service, and the financial and reputational risks and damage that may occur. There are many vulnerabilities that need to be protected-against and actively managed in both the physical and logical worlds; a large part of this [cyber] security sits in the ops IT function. A specific activity is Access and Identity Management (‘AIM’), i.e. granting users the right level of access and permissions to systems and data they they are entitled to, and of course preventing unauthorised access.
Maintenance & release management
Maintenance of software (application code and firmware) and systems includes both planned upgrades, i.e. new versions/releases and patches, and ad hoc defect management. It is typical that a problem with IT systems is first raised as part of an event or incident management function, see Support & Incident Management below. However, deploying software changes into production, i.e. live, overlaps with development of new code – see later section of DevOps – and more so. As well as testing the new code or fix works as expected there is the need to ensure that existing applications and websites still work, and that there is no data corruption or degradation of current services (called regression testing). There is also the important matter of when and how changes are deployed; either in real-time as hot deployments (also referred to as in-flight) for important or time-critical changes, or during a scheduled or unscheduled system outage. In all cases it is necessary to have facilities to reverse (back-out and roll-back) changes to a previous code instance.
Performance, Monitoring and Service Levels
When developing new IT systems it is necessary to focus as much on the ‘non-functional’ as the ‘functional’ requirements. The former includes expectations about performance under normal and peak levels of activity, availability, including maintenance cycles and expected down time, and security provisions. All these dimensions, and many more, determine the overall quality, robustness and usefulness of the software product(s) and should form part of a handover documentation – at least – and possibly a more formal service contract between the system owner and the system manager or supplier. Such an agreement may need to be applied for the life of the system; the lifetime cost of ownership could be significantly more than the the initial development. This is one of the problems that Agile software development is trying to fix.
Service agreements are particularly significant if the system owner is using third party software, distributed or cloud-based services, sometimes referred to as ’something-as-a-service’ e.g. Software (SaaS), Platform (PaaS) and Infrastructure (IaaS) (Everything as a service). The model where you lease, borrow or consume services on demand is becoming increasingly common in IT and other industries. However, there are risks in not owning the asset outright and not having complete and direct control of your digital estate.
Support & Incident Management
IT operations normally has a role to play in supporting both internal and external system and service users. Depending on the nature of the problems encountered, the maturity of users and the workflow, an IT Service Desk may be the first point of contact (FPOC) for new incidents – sometimes called 1st line – or they may get involved at some other downstream point in the process. The management of issues (another common catch-all term for things that are not performing as expected) and support tickets will be subject to classification and prioritisation, rules for escalation, and Service Level Agreements (SLAs).
And last, but not least, the above is unlikely to happen by itself, so some oversight is needed; to set a strategic direction, including investment & procurement; provide day-to-day planning; develop and maintain the IT services and infrastructure; establish and maintain governance procedures; set standards, ensure compliance and maintain accreditation, as required – see below Standards and frameworks below. Ultimately operations should ensure Business Continuity and that IT systems play their part in meeting the needs of the parent organisation, its staff, partners and customers.
Standards and frameworks – getting organised
As you can probably tell from the above list, Operations and IT Service Management (ITSM), is a large and complicated subject area, which lends itself to formal practices, governance and standardisation. There is an international standard ISO/IEC 20000 and a complementary set of best practices called ITIL (Information Technology Infrastructure Library). ITIL was created by the UK government to bring together disparate internal practices and as a guide for private sector contractors. It has since evolved and matured into a de facto standard with a public/private joint venture managing the framework and granting licenses for accreditation centres. Edition 4 was published in February 2019.
New models for Operations Management
There are some newish kids on the block, software engineering approaches to manage technical infrastructure and operations more effectively; Site Reliability Engineering (SRE) and DevOps have come from different places but can be complementary, albeit overlapping, approaches.
DevOps – simply a conflation of Development and Operations – is a spin-off from Agile, which attempts to improve communication, multi-disciplinary team-working and quality between the previously disparate disciplines involved in developing and deploying new code/applications vs. maintenance of the existing digital estate. Philosophically the former exists to introduce change which can cause instability and undermine the function of the latter. A common approach is to make an enhanced development team responsible for quality and compliance of the operational code. Both areas benefit from advances in Continuous Integration and Continuous Delivery/Deployment to increase the speed and reduce the friction of system change with automation, collaboration and an underlying Agile mindset.
SRE started at Google in the early noughties, to address a similar issue, to prevent operations and development pulling in different directions. The solution is to have a new role, a Site Reliability Engineer, who wants to keep production processes stable but also works with developers to ensure code quality and stability.
I think this quote from techopedia summarises it well:
The goal of DevOps is to focus on empowering developers so that they can build and manage services … SRE is meant for the monitoring of applications and services after they have been deployed and to implement automation for improving the health and availability of a system.
Both approaches try to break down organisational barriers (sometimes referred to as silos) in an environment that favours incremental changes and automation.
I hope this introduction to Operations and related topics has been useful? Please comment and let the IT chemist know if you have any comments or questions. Thank you.
© 2015-19 IT elementary school Ltd.