AlwaysOn Secondary down -> Whole Cluster goes down

  • Hello everyone,

    We had a big issue today during maintenance work in our SQL environment. I hope you can help us out what we are doing wrong in our SQL environment.

    So our environment:

    - 2x SQL Server 2014 Enterprise on Windows Server 2012 R2 (SRV1 and SRV2)

    -- Both Hyper-V VMs on different Hosts

    -- Both configured to an Windows Failover Cluster and AlwaysOn Availability Group (AG1)

    -- AG Listener: AG1_lis

    -- No shared storage (each Hyper-V Host has its own local storage)

    -- Asynchronous Mode

    -- SRV1 is primary, SRV2 is secondary SQL node

    What happened?

    - Shutting down Windows on SRV2 due hardware maintenance

    - Cluster goes offline, AG1 goes offline

    -- Error message: "Stopped listening on virtual network name 'AG1_lis'."

    -- Error message: "The availability group database "DatabaseXY" is changing roles from "PRIMARY" to "RESOLVING" because the mirroring session or availability group failed over due to role synchronization."

    Results?

    - AG1_lis wasn't available for our applications and they stopped working properly because database connection was lost!

    I think, I HOPE, this is not the normale behaviour when one node is shutting down (especially the secondary node!)

    I already searched a little bit and found two things which could be the problem but I am not sure:

    1. We haven't set any quorum. I had read a lot of documentation about AlwaysOn and my conclusion was that a quorum is not necessary in our environment. Am I totally wrong at that point ? Do we need a quorum in a 2-node cluster and without shared storage?

    EDIT: This might be the problem. I will set up a Node and File Share Majority Quorum to solve this problem...

    2. I found this topic in this forum: http://qa.sqlservercentral.com/Forums/Topic1465938-2799-1.aspx

    I see this in our environment. Our "Current Host Server" in Windows Failover Cluster Manager is SRV2. In SQL Server, our Primary Node of the AG1 is SRV1. Is this the problem? But why these differ? Is it normal?

    I hope you can help me with these two questions and my problem...

    Best regards,

    babo

  • you're losing quorum and this is taking the cluster service offline, when this happens any cluster roles will also go offline.

    for a 2 node cluster you should employ a witness, which witness have you used?

    please supply the details of the following powershell query run on one of the cluster nodes

    Get-ClusterQuorum

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Correct Quorum Witness is missing in the above setup.

    However I have same setup in my work place. My Scenarios are....

    1. If Node2 (Secondary) gracefully rebooted (because of Windows patching), all cluster resources are went down and came back. After successful reboot, all resources are automatically came ONLINE. No issues found. This is fine.
    2. When Node1 (Primary) is rebooted, cluster resources and Cluster went down and never came ONLINE after after successful reboot. That means we need to Bring up those cluster resources manually. This is worst scenario.
    3. we have DISBALED the Node1 ((Primary) Network. Immediately all our RDC connections disconnected and Cluster went down and AG Listener down on Node1. Now we re-enabled the network on Node1 (Primary). We noticed that cluster never ever came back by itself and cluster resources are also OFFLINE even after NODE1 is up and running. Now, we rebooted the Passive node2, we noticed that cluster resource and AG Listener and all resources are came back immediately.
  • In case of 2-members cluster, when one node doesn't see another one,  cluster decides to shutdown itself to prevent split-brain situation when you potentially have two copies of databases online with unpredictable amount of clients working with both databases (changing data, etc).

    So, third member is needed to make a quorum either with first node or with second one.

    Node in quorum will have databases online, another one (without quorum) will go down immediately.

     

     

     

  • What is the windows OS version and edition you are using for these cluster nodes?

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Windows Server 2016 Standard and SQL Server 2017 EE

     

    regards

    Sree

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply