HNS IT: PowerSchool Adventure Time

Wednesday morning, I prepared to run the annual process that promotes students, makes them have course schedules, archives grades and attendance, clears fees, etc. It is the biggest, most beastly task we do with the database, and I keep copious notes about every step of the process. These notes change a bit every year, because usually we're forced to update to a new version of PowerSchool sometime during the intervening year, and something always changes. This year, the notes changed a lot.

It began typically. I'd done the validations, fixed the anomalies, exported the historical data, etc. I saved a datapump (full copy of the Oracle db), and I took a VMWare snapshot of the whole server. (I haven't done snapshots in awhile, and I forgot to shut down the guest first. That'll be significant later.) Two backups is better than one, right?

My notes from the previous two or three years indicate that step 16 of 49 in the end-of-year process is to turn off atomic sync mode. This is some kind of magical incantation that used to be found in the "special operations" page, in a dropdown containing a schizophrenic mix of mundane and horrific commands:

But, in PS version 8.3, "Turn off atomic sync mode" is gone from the list. So, I thought, "well, we must not need that anymore."

Wrong.

So anyway, step 18 -- "commit the high school's schedule" (copy from the staging area to the production area, basically) -- got to the part where it copies the Schedule Section records to the live side, and proceeded to vomit an unending series of:

ATTNENTION! Critical Error generated from MWCP_ParseResponse Description=Milan sync error: DALX-Milan Atomic Sync Failure: DALX-Milan Sync Failure Detected, Response Command=TRANSACTION MsgNum=991
ATTNENTION! Critical Error generated from MWCP_ParseResponse Description=Milan sync error: DALX-Milan Atomic Sync Failure: DALX-Milan Sync Failure Detected, Response Command=TRANSACTION MsgNum=992
ATTNENTION! Critical Error generated from MWCP_ParseResponse Description=Milan sync error: DALX-Milan Atomic Sync Failure: DALX-Milan Sync Failure Detected, Response Command=TRANSACTION MsgNum=993

...and so on. With a sinking heart, I began to read the support pages. Herein follows an excerpt, verbatim:

If the schedule has not committed successfully please attempt the following workaround(s).

A) Commit the Master Schedule a second time to resolve the issue.

If workaround A is unsuccessful move on to workaround B

B) Verify that duplicate ScheduleSections (duplicate Course/Section Numbers) and/or duplicate ScheduleCC records are not present. The following PlugIn will assist in finding these duplicates. [link to a user-supplied add-on]

If workaround B is unsuccessful move on to workaround C

C) Using DDA, delete the Section_Meeting records for the School/Year you are attempting to commit. Then recommit the schedule. Caution: Manual modifications to database records should only be attempted by authorized technical contacts, and can result in the loss of information if not executed properly. Before making any changes via Direct Database Access, you should perform a manual backup of your PowerSchool database. For instructions on how to perform a manual backup, click here.

These errors occur when the commit process attempts to copy invalid data to the live side, such as the items in workaround B. This prevents bad data from be added to the live side. The ability to disable the Atomic Sync was removed in PowerSchool 8.1.1 (Per Release Notes KB 73319). Development made several fixes to committing the schedule in 8.2 as well. Users who are still on version 8.1.1 will need to upgrade.

...wow. I'm on 8.3 by the way.

Then I began to read the customer comments on that KB article, and it got even better.

I just pushed the above plug in and it works great. Why not include this in PS by default rather than make us hunt for it after the fact?

in section view and teacher schedule view, some sections expressions did not display. I checked the sections to see if the day and period boxes were still checked and they were, so I resubmitted that page, but it still did not make them appear. I then used dda to see if the expressions were actually there, and there were. I then had another user log in to show them the problem, and the expression magically appeared. I went back and logged out and back in again, and the expressions then appeared for me. On my next school that I committed (and by the way we are doing it twice at each school as work around for initial problem) I had the same problem and same situation where another user could see the expressions and I could not even after I logged out and back in. I even tried logging in with another browser and they still were not displaying for me, but were for 2 other users. I waited about 30 more minutes and checked and again, they magically appeared. I have no explanation for it on one of my schools I saw all the expressions immediately myself, but all other schools so far, the expression is not displaying initially on some sections. Later it seems to fix itself.

None of the three workarounds worked for me. Every time I recommitted the schedule, the CC records from the previous attempt to commit were not deleted. This resulted in duplicate CC records. Without the ability to disable atomic sync, I could not delete the records via DDA. The duplicate CC records had to be deleted by Tier 2 tech support at Pearson. Everything is working now, but it took 3 days from the first commit to the eventual resolution of the issue.

After reading all of the above comments. I found the following worked for me.
1. Commit with both options (students and sections).
2. Received errors just mentioned in this KB Article.
3. Ran the commit a 2nd time.
4. Warning that calendar entries would need to be overwritten, but no errors on 2nd schedule commit.
5. Expressions showed up sometimes, other times not so I did a work around found in the knowledge base comments. Go to school setup > days. Submit with no changes. Expressions start showing up for me if they wasn't before.
6. Enrollment counts do not show up until Calendar Setup is completed then the following special operations.
7. Reset Section Meetings per KB 10247. Rebuild schedules and rosters
8. Realign section/schedule enrollments. Reset Class Counts
9. School > Regenerate Bitmaps
Finally everything appears normal.

At this point, my despair turned to panic. The errors were still scrolling from my abortive attempted commit two hours earlier. The rate was approximately 1 error every 2.5 seconds, and it was into the MsgNum=2000 range. I wouldn't be able to try any of the workarounds until it finished. How much longer could it take? I opened a tab into our test server and queried for this year's scheduleCC records. (A CC record represents a link between a student and their enrollment in a section of a course.) So, about 775 students who take a mix of mostly semester-long courses and a few year-long courses, with a courseload of about 7 classes per semester, so average is something like 12 unique course sections per year from the database's perspective.

775 students * 12 classes: 9300 records

9300 records * 2.5 seconds to try to copy, fail, and display an error: 23,250 seconds = about 6.5 hours

That's how long I would have to wait before I could attempt the first of three workarounds. Which might lead to the same result.

It was now about noon. Despair, which had become panic, now became rage. I thought of someone who has been a PowerSchool guru to us for several years, whose name I will not put here, since he swore me to secrecy. I expected nothing more than sympathy, but we had the following exchange:

Me: This is completely fucking insane. Do you have any other customers who've run into this? Do you have any suggestions?

Him: Did I give you the secret instructions for turning off syncing before the commit?

Me: The special op to turn off atomic sync is gone. Do you know a backdoor?

Him: Yep! In PowerSchool, Go to System, and in the address location after admin/tech/ type in executecommand.html
You should get a big entry box, in it type:
**DALX_SetGlobalSyncOff
Then Execute
Try the Commit again. Once completed successfully and you’ve run EOY, go back to executecommand.html and:
**DALX_SetGlobalSyncOn
Then Execute
You didn't hear it from me!!!!!! Let me know how it worked!

Me: OMFG...OK...so now my question is, do I wait for it to finish failing, or do I revert the VMWare snapshot I took right before the shit hit the fan? ...I'm reverting the snapshot. This is going to take many hours otherwise.

(You can't even GET to the executecommand page by clicking links in the web app. You have to type it in. Just for an added bonus.)

OK so, you might recall the part earlier where I mentioned a VMWare snapshot, and how I forgot to shut down the guest first. Well, it turns out, that's a bad idea. The server was not cool after I reverted and powered it on. A nagging mental itch began to form in the recesses of what was left of my mind, and I wrote to my close friend and VMWare master Jordan:

Me: I took a snapshot with the options to NOT save the virtual memory, and to YES quiesce the filesystem. Turns out I needed to revert to that snapshot, which I did and which succeeded/completed. The guest acts like I pulled its power -- had to power it back on, got the Windows "why the unexpected shutdown" message, Oracle's pissed, etc. What did I do wrong?

Him: It is exactly like you pulled the power plug if you snap when the machine is running. All the quiescing, etc. reduces the chance of corruption. I usually turn the machine off, snap, and then fire it back up. That way if I go back to the snap, it doesn't get unhappy.

oh, fuck.

Me: Ah. I forgot to do that. Oh well, I took a datapump beforehand and I'm on with support now. Fuck ME. Thanks.

Him: It's bad luck, mostly. I mean, I've hard killed or power cycled computers hundreds of times, and they almost always spring right back up. My nightly backups are all like you did it today: crash consistent. Again, how often does something get corrupted if you power cycle a computer? nearly never.

So I called support, for the 3rd time in one day, and explained the situation. This is the part that finished me off. I said, smugly, "this is not a big deal, because I took a datapump beforehand. It's right there on the desktop. Can you just restore it for me?" And they put me on hold for a bit. Then they came back, and said:

This issue is not something that a datapump restore can fix.

I want you to take a moment to ponder that statement. For eight years, the nightly PowerSchool backup -- the heart and lungs and soul of everything our school district is about -- has been an Oracle datapump, religiously copied off to a server that's over two miles away from the high school, just in case a meteor hits the building with the server room in it. And I had made an EXTRA BACKUP, JUST IN CASE, BEFORE I DID THIS BIG DEAL ANNUAL THING. And now support was telling me, "that's not good enough."

(To be fair, I suppose that if need be, we could've reformatted the hard drive, reinstalled Windows and PowerSchool and Oracle, and THEN imported the clean datapump into the clean environment. Maybe. I'll still never know. It would've taken days, I know that much.)

Anyway, the issue got escalated to Tier 2 -- woohoo! Then...the waiting began. It was now 2:30pm, and my new goal was simply to get things back to where they'd been at 8:30am.

........................

The rest is rather anticlimactic, really. Around 4:00, I saw the mouse start to move, and I watched someone from Tier 2 do some kind of mega-voodoo, and an hour or so after that, PowerSchool and Oracle were happy. I did a few spot checks, ran the guru's "Secret Sauce" to turn off atomic sync, committed the schedule, no errors. Proceeded with steps 19 through 49 without incident. My last note to him was:

IT'S WORKING
COMMIT COMPLETE
You, sir, are a God.

Finished the annual end of year process, which normally takes about two hours total, at about 11:30pm, 15 hours after I'd begun.

(Had a nasty scare two days later, when the middle school scheduling person got a system error while trying to reassign any section, or enroll any kids, or do anything at all, really. 5th support call in one week and they had me run another one of those "special operations" which I've NEVER needed to do before; then, they had me run it a second time. Then, everything worked. Even he didn't know why.)

Moral of the story: Shut down before you take a snapshot, and find an African medicine man to fix your database, I guess.

HNS IT

Friday, June 24, 2016

PowerSchool Adventure Time

2 comments: