Telemetry and troubleshooting primer

Telemetry and troubleshooting primer

1.    Introducing the telemetry blog series

This week we are publishing the first blog post of a long series about the design and implementation of Cloud Service Fundamentals on Windows Azure. This reusable code package demonstrates how to address some of the most common scenarios we encountered working on complex Windows Azure customer projects through a number of components that can become the basis for your solution development. The first component we are presenting is Telemetry. It is a vast topic itself and we decided to break it down into four main buckets, as we described in this introduction.

Our goal for this series of blog entries about telemetry is to guide you throughout the entire journey, from basic principles up to the end-to-end solution we have built in the Cloud Service Fundamentals (CSF) package.  As many other “build or buy” scenarios, depending on your specific requirements and resources you might want to implement every aspect of a telemetry solution by yourself or evaluate what is already available on the market. The “App Services” section of the Windows Azure Store will give you an overview of the most common choices we have nowadays in this space. In any case, going through this series will give you a good coverage of key aspects of monitoring and troubleshooting your cloud-based solution.

2.    Introducing the basics of telemetry and troubleshooting

In this first post, we consider some of the basic principles around monitoring and application health by looking at fundamental metrics, information sources, tools, and scripts. You can use these to troubleshoot a simple solution deployed on Windows Azure (few compute node instances, single Windows Azure SQL Database instance). This is an elaboration of what you can find on official documentation.

Operating solutions in a Windows Azure cloud environment is a combination of traditional, well-known, troubleshooting techniques and specific toolsets, which reduce the intrinsic added complexity introduced by a highly automated and abstracted platform.  When the number of moving parts in a solution is reasonably small, such as a few compute nodes plus a relational database, troubleshooting and diagnostics practices can be easily performed manually or with minimum automation.

However, for large-scale systems, collecting, correlating, and analyzing performance and health data requires a considerable effort that starts during the early stages of application design and continues for the entire application lifecycle (test, deployment, and operations). This is where CSF telemetry component can help you reducing the implementation effort.

Providing a complete experience around operational insights helps customers to meet their SLAs with their users, reduce management costs, and make informed decisions about present and future resource consumption and deployment. This can only be achieved by considering all of the different layers involved:

  • Infrastructure (CPU, I/Os, memory, etc.)
  • The application (database response times, exceptions, etc.)
  • Business activities and KPIs (specific business transactions per hour, etc.).

Process, correlate, and consume this information will help operations teams (maintaining service health, analyzing resource consumptions, managing support calls) and development teams (testing, troubleshooting, planning for new releases, etc.).

For large-scale system, telemetry should be designed to scale. It must execute data acquisition and transformation activities across multiple role instances, storing data into multiple raw data SQL Azure repositories. To facilitate reporting and analytics, data can be aggregated in a centralized database that serves as a main data source for both pre-defined and custom reports and dashboards. These aspects will be considered in next posts of this series.

Leave a Comment
  • Please add 8 and 4 and type the answer here:
  • Post
Wiki - Revision Comment List(Revision Comment)
Sort by: Published Date | Most Recent | Most Useful
Page 1 of 1 (1 items)
Wikis - Comment List
Sort by: Published Date | Most Recent | Most Useful
Posting comments is temporarily disabled until 10:00am PST on Saturday, December 14th. Thank you for your patience.
Page 1 of 1 (1 items)