Jump to content

MattScott

CEO
  • Content Count

    1300
  • Joined

  • Last visited

Posts posted by MattScott


  1. Hi there,

     

    We recently had an issue with our primary database. The situation is fixed, and no data was lost. But we are still restoring the old Hoplon data that used by the migration system.  You should be able to migrate your account by Friday.

     

    Thanks,

    Matt


  2. Hi all,

     

    I'm not interested in sugarcoating or spinning this last week.

    I am sorry for all the downtime.

     

    Many players have asked what happened, so I'm going to do my best to summarize the events.

     

    On Monday the 29th, we took the servers down to move data centers and migrate to all new hardware for our backend systems. This move was necessary for a couple reasons, but mainly because a lot of the hardware was more than 5 years old, and we were already seeing failures and system performance problems. It was only a matter of time before something critical failed. The timing also happened to line up with the end of our last legacy (ludicrously overpriced) hosting contract, and we needed to make some network architecture changes to facilitate some of the features coming after the Engine Upgrade. 

     

    I figured we could kill three birds with one stone.

     

    The core challenge was managing all the information needed to drive APB since all the hardware was new. That meant backing everything up and then hand carrying a series of large hard drives from one location to another. The hardware had been prepped and configured in advance, so while we knew this would be challenging, but we had a fairly detailed plan and felt we could manage any issues that popped up.

     

    Problem #1: Unfortunately, during the move we unearthed some buried issues that delayed our ability to bring servers back online for quite a while. The team did a solid job of working through those problems and even recompiling code in a couple places to remove landmines that we stumbled over. But once we got the servers back online, I felt we did a decent job.

     

    Problem #2: Shortly after we went live, the brand new RAID controller in our new primary database server quickly degraded in performance. To make the situation worse, we rushed to get servers back online and decided it would be okay to let players back in while the secondary database server finished syncing. The hardware failure hit so quickly that the secondary wasn't ready, so we couldn't failover. At this point we made an effort to keep the servers online through the weekend, and while our jury-rigged fix allowed some players to get on, it also lead to many other players being unable to login (Error 9). The team decided the quickest way to fix the issue would be to build an entirely new primary database server and then swapping everything over on Monday. We didn't want to risk moving damaged drives to the new servers, so we needed a complete backup to make sure we didn't lose anything.

     

    Problem #3: Once we shutoff all the servers and started the backup, we found that the faulty RAID controller could only copy files at the rate of 1GB per minute. After 18+ hours, we were finally able to complete the backup, and then finally get the new server finalized and back online.

     

    There are a lot of things that went wrong, but in the end, I should have planned better. With that much new hardware, we were bound to have an issue somewhere.

     

    To make it up to everyone, on Friday we will be turning on 2 weeks of free Premium for all players. For anyone who has existing Premium, this will add to it.

     

    I never want to have to make this kind of apology again.

    Little Orbit can and will do better in the future.

     

    EDIT: We just started awarding the Premium (4pm Friday Pacific time). There are a lot of players to gift, so we're doing this in waves. To start we will hit everyone who has logged in within the last 30 days sorted by most recent login first. Then we'll go back 30 days, etc.

     

    Sorry,

    Matt

    • Like 65
    • Thanks 36

  3. Hi everyone,

     

    Unfortunately, we are still working on backing things up properly. The file transfer from the bad server is very slow. We’ve done the math on the remaining files at their current rate and added repair time, and we feel the outage is going to run another 6 hours to roughly 6am Pacific time.

     

    We are working through it as fast as we can.

     

    Sorry,

    Matt

    • Like 9
    • Thanks 13

  4. Hi everyone,

     

    We're running a little long on the hardware maintenance.

    We're currently estimating another 6 hours with servers coming online around midnight Pacific time.

    We apologize for the delay, but we also feel it is critical to get this piece of equipment working properly for everyone.

     

    I'll post as soon as servers are back online.

     

    Sorry,

    Matt

    • Like 7
    • Thanks 10

  5. Hi there,

     

    Lixil has updated her post on this issue. 

     

     

    We are hoping to fix the issue tomorrow (5/6).

     

    Thanks,

    Matt


  6. Hi all,

     

    We are working on the issue. Lixil has updated her original post with more information.

     

     

    We moved many servers last week. Unfortunately, one of them has some bad hardware and is malfunctioning. Logging in is hit or miss right now.

     

    We are hoping to fix the hardware tomorrow (5/6).

     

    Thanks,

    Matt

    • Like 1

  7. Hi there, 

     

    Lixil just updated her post in Social.

     

    Essentially, we have a bad piece of equipment in the new data center that is causing a problem. We have already secured a replacement part and we hope to schedule downtime tomorrow (5/6) to fix the issue.

     

    Sorry,

    Matt

    • Thanks 1

  8. Hi all,

     

    On Monday, April 29th, we scheduled downtime to physically move from one data center to a new one. This move was unavoidable based on a very expensive, legacy Reloaded contract that had ended. Even though we asked for an extension, we were forced to move out before May 1st 2019.

     

    As you know, Fallen Earth is an extremely old game. The servers run on an OS that is no longer available to download, it takes more than an hour to even reboot, and it requires many different databases that all have to be synced to operate correctly. We made backups and planned to move each system one by one into the new data center. However despite many precautions, some data was lost. The engineers have spent their waking hours attempting to find the right mix of files to get everything restored properly.

     

    We did eventually get the system back online, but it appears roughly 2 weeks of progress was lost.

     

    Having exhausted all other options, we are going to be putting the servers back online and moving forward.

     

    In the meantime, we'll be doing the following to help players recover:

    - We'll be giving out Commander to all players for 2 weeks to help them get caught back up

    - Anyone who lost purchases due to the rollback can open a trouble ticket at http://support.gamersfirst.com, and we'll escalate getting those taken care off as quickly as we can

     

    We're not going to start the Commander for another couple days, so that all the players can get up to speed on what has happened. I want everyone to be able to take full advantage of the boost over the coming weeks.

     

    Please know that the team worked very hard to get us to this point, and we are committed to getting the back end re-written so it can be properly supported in the future.

     

    EDIT: We will be turning on 2 weeks of Commander for the Fallen Earth players.

    EDIT: We waited an extra couple of days to try and make sure server performance was better.

     

    Effective 5/16, we have activated a 4 week Commander code as compensation to the players in the hopes of helping them catch up on lost progress.

    The code is: FallenNot4gotten

     

    Apologies,
    Matt

    • Like 4
    • Thanks 8

  9. Hi all,

     

    I think we've reached the limits of what we can do, and the down time has already been excessive.

    We did some tests and the newer data is too incomplete to work. 

     

    With that in mind, the servers are going to be put back online shortly, and I'll make a public post about the missing data for the rest of the players.

     

    Moving forward:

    - We'll be giving out Commander to all players for 2 weeks to help them get caught back up

    - Anyone who lost purchases due to the rollback can open a trouble ticket at http://support.gamersfirst.com, and we'll escalate getting those taken care off as quickly as we can

     

    The team is committed to getting the back end re-written so it can be properly supported.

     

    Apologies,
    Matt

    • Thanks 4

  10. Hi all,

     

    We are going to try getting our secondary environment up and running.

    Then we can test a new set of data. This wont be a super fast process, and many of my team of recuperating this weekend.

    Bear with us.

     

    My advice for current players would be to not go crazy. If we can confirm a solid way to restore the 2 weeks of missing progress we will.

    But that will mean losing anything since the servers came back online.


    Thanks,
    Matt

    • Like 3

  11. Hi all,

     

    I am looking at the data issue with my team, and there is no easy answer.

    Right now we have two choices.

     

    1) Leave it the way it is and work on fixing the accounts that lost paid items. We have records on the payment side, so it should be easy to restore those. The team is exhausted, but over time we can also try restoring some of the new data in a separate area which would allow us to verify some of the bigger lost progression items / in-game items in order to grant those back to players.

    or

    2) Take the servers down and try another round of restoring data from a different set of possibly newer backups. We already know one of the databases from this newer set of backups is significantly out of sync and older. So there is risk that we will introduce a whole bunch of problems with incompatible data.

    Personally, as painful as it is, I'm going to recommend that we plow forward and escalate the support tickets for the players who lost real money purchases.
    Apologies for the issues.

     

    Sorry,
    Matt

    • Like 1

  12. 30 minutes ago, zefcool said:

    hi, were there no plans of hardware upgrade that would speed up the boot and troubleshooting to begin with ?

    seems like move the relics first and upgrade them to 2019 standards for FE 2.0 project was a bad idea, relics died in the middle...

    maybe its a clue to upgrade the hardware now rather than later ?

    We can't upgrade the hardware. And the code for the servers runs on an OS that isn't available any more.

    For the move, we made images of everything. Backed up everything. And then moved the hardware 1 to 1, so as not to disrupt anything.

    Even with all of our precautions, we still have problems.

     

    My approach is to upgrade the code first and then we can update the hardware properly. That effort has been underway for quite a while now.

    • Like 2

  13. Hi all,

     

    I can’t know how frustrating it is for everyone right now. I wish I had more details to give out. Fallen Earth is a collection of many databases and many different servers that all have to mesh together properly for us to unlock the doors and let players in. There were problems in the move and in the restore process that we are still working out.

     

    To give you an idea of how old and large the game is, it takes over and hour to full reboot everything and come back online. That means we make some changes, but then wait significant amounts of time just to see the results. Then we make more changes and then wait some more. It’s been extremely frustrating.

     

    The team is going to continue working on the servers this weekend till they are online properly.

     

    Stay tuned.

     

    Thanks,

    Matt

    • Like 1
    • Thanks 7
×
×
  • Create New...