I recently encountered an issue with an Azure Function Event Hub trigger running on a Linux Consumption Plan, whereby events stopped being consumed by the trigger once it went to sleep. I learned a lot about consumption plan scaling and diagnosing what was going on, so in this post I’m going to share the key takeaways that I picked up as it was a bit of a journey.
Background: Azure Function Event Hub trigger stopped processing events
This was in a low-throughput scenario for testing - so the rate of events coming into the Event Hub was low/sporadic. What I was seeing was:
- my Event Hub trigger (C#) would stop consuming events - no errors, no failures, events were still hitting the Event Hub but just weren’t being consumed
- I went into the Function App in the Azure Portal to investigate, and it would then kick back into life and process the events despite not changing anything
- inevitably, it would then stop consuming events again after a short period of time once I came out of the Portal
Based on some Googling, I saw others raising similar issues going back to 2017. But either there was no real outcome as the raised issues went cold and closed, or they were due to issues that had now been resolved and weren’t the issue in my case.
Background: Scale controller
A good intro can be found in the MS docs here. The key points are:
- the scale controller monitors events from whichever source is triggering your Azure Function (Event Hub in my case) to determine when instances need to scale out or in
- this will scale instances down to 0 when no events are coming in, and your app will then go to sleep
- this should then scale instances back up when it detects events are coming in and the number of current instances is not sufficient to process
- in the scenario where it needs to scale back up from 0 instances, you will experience a cold start - a slight delay while it spins up an instance
In my scenario, no matter how long I waited, it was not scaling up from 0 despite new events coming in. It only processed them when I went into the Azure Function in the Portal - this is explained by the fact that when you go in there, it actually forces it to be loaded onto an instance. This is bypassing the scale controller. Once you know this behaviour, it helps focus you on to “why isn’t the scale controller doing its thing?”
Diagnosing scale controller
The key to identifying the underlying issue, was to enable scale controller logging by adding an appsetting for SCALE_CONTROLLER_LOGGING_ENABLED. I wanted logging out to Application Insights so I set it to AppInsights:Verbose. Alternatively you can log out to a blob if you prefer, as per the docs, but the following relates to diving into AppInsights. There is info on how to query those logs here - I used this query to see everything going on, which was key:
traces
| extend CustomDimensions = todynamic(tostring(customDimensions))
| where CustomDimensions.Category == "ScaleControllerLogs"
| order by timestamp desc
I could then see this error:
FunctionName: 'TestEventHubTrigger'. Trigger's Event Hub name '%EventHubSettings:HubName%'
failed to resolve from AppSettings. Error: ''%EventHubSettings:HubName%' does not resolve
to a value.'.
Here’s my trigger method signature:
[FunctionName("TestEventHubTrigger")]
public async Task Run(
[EventHubTrigger("%EventHubSettings:HubName%",
Connection = "EventHubSettings:Connection",
ConsumerGroup = "%EventHubSettings:ConsumerGroup%")]
EventData[] events,
ILogger log)
My Function does work and processes events when I nudge it in the Portal, those appsettings are configured correctly - yet the scale controller is having an issue resolving them.
As I’m targeting Linux, I had the appsettings configured in the portal using the __
form - i.e. EventHubSettings__HubName
.
So I tried out changing that setting from %EventHubSettings:HubName%
to %EventHubSettingsHubName%
so it’s not nested but root-level instead and updated the appsetting to
EventHubSettingsHubName
. Restarted the Function App - bingo! Well, nearly - it didn’t like the ConsumerGroup setting either, but showed I was on the right path.
Once I rolled this out, I then spotted a consistent higher level of requests on the Event Hub, even when events were not coming in. This looks like a good base indicator of a healthy scale controller as I believe it shows the scale controller monitoring the Event Hub for incoming messages. Here’s a screenshot of the Requests metric from my Event Hub namespace dashboard - you can clearly see the increase in the base level of Requests as soon as I rolled out the fix:
Resolution
Changing those appsettings to avoid the multiple levels was the fix. Once I updated them all, including the Connection one (which does not required the %), I then started seeing the Event Hub trigger behaving properly. Via the scale controller logging, I could see no more errors. Waiting for it to log it had changed the instance count down to 0 + a bit of extra time to pass, it then correctly scaled out when I fired an event in. The cold start resulted in about a 10-15 second delay before it picked up the event.
This feels like a workaround and perhaps an issue within the scale controller - given the trigger works and pulls in those appsetting substitutions fine.
Key Takeaways
- Use SCALE_CONTROLLER_LOGGING_ENABLED appsetting temporarily to help troubleshoot what is going on
- Avoid any issues with appsetting substitution, by keeping them as simple top-level settings - not nested levels
- The Requests metrics graph in the Event Hub dashboard is a good indicator of base health