Friday, June 24, 2016

PowerSchool Adventure Time

Wednesday morning, I prepared to run the annual process that promotes students, makes them have course schedules, archives grades and attendance, clears fees, etc. It is the biggest, most beastly task we do with the database, and I keep copious notes about every step of the process. These notes change a bit every year, because usually we're forced to update to a new version of PowerSchool sometime during the intervening year, and something always changes. This year, the notes changed a lot.

It began typically. I'd done the validations, fixed the anomalies, exported the historical data, etc. I saved a datapump (full copy of the Oracle db), and I took a VMWare snapshot of the whole server. (I haven't done snapshots in awhile, and I forgot to shut down the guest first. That'll be significant later.) Two backups is better than one, right?

My notes from the previous two or three years indicate that step 16 of 49 in the end-of-year process is to turn off atomic sync mode. This is some kind of magical incantation that used to be found in the "special operations" page, in a dropdown containing a schizophrenic mix of mundane and horrific commands:



But, in PS version 8.3, "Turn off atomic sync mode" is gone from the list. So, I thought, "well, we must not need that anymore."

Wrong.

So anyway, step 18 -- "commit the high school's schedule" (copy from the staging area to the production area, basically) -- got to the part where it copies the Schedule Section records to the live side, and proceeded to vomit an unending series of:

ATTNENTION! Critical Error generated from MWCP_ParseResponse Description=Milan sync error: DALX-Milan Atomic Sync Failure: DALX-Milan Sync Failure Detected, Response Command=TRANSACTION MsgNum=991
ATTNENTION! Critical Error generated from MWCP_ParseResponse Description=Milan sync error: DALX-Milan Atomic Sync Failure: DALX-Milan Sync Failure Detected, Response Command=TRANSACTION MsgNum=992
ATTNENTION! Critical Error generated from MWCP_ParseResponse Description=Milan sync error: DALX-Milan Atomic Sync Failure: DALX-Milan Sync Failure Detected, Response Command=TRANSACTION MsgNum=993

...and so on. With a sinking heart, I began to read the support pages. Herein follows an excerpt, verbatim:
If the schedule has not committed successfully please attempt the following workaround(s).
A) Commit the Master Schedule a second time to resolve the issue. 
If workaround A is unsuccessful move on to workaround B 
B) Verify that duplicate ScheduleSections (duplicate Course/Section Numbers) and/or duplicate ScheduleCC records are not present. The following Plug­In will assist in finding these duplicates. [link to a user-supplied add-on] 
If workaround B is unsuccessful move on to workaround C 
C) Using DDA, delete the Section_Meeting records for the School/Year you are attempting to commit. Then re­commit the schedule. Caution: Manual modifications to database records should only be attempted by authorized technical contacts, and can result in the loss of information if not executed properly. Before making any changes via Direct Database Access, you should perform a manual backup of your PowerSchool database. For instructions on how to perform a manual backup, click here. 
These errors occur when the commit process attempts to copy invalid data to the live side, such as the items in workaround B. This prevents bad data from be added to the live side. The ability to disable the Atomic Sync was removed in PowerSchool 8.1.1 (Per Release Notes KB 73319). Development made several fixes to committing the schedule in 8.2 as well. Users who are still on version 8.1.1 will need to upgrade.
...wow. I'm on 8.3 by the way.

Then I began to read the customer comments on that KB article, and it got even better.
I just pushed the above plug in and it works great. Why not include this in PS by default rather than make us hunt for it after the fact?  
in section view and teacher schedule view, some sections expressions did not display. I checked the sections to see if the day and period boxes were still checked and they were, so I re­submitted that page, but it still did not make them appear. I then used dda to see if the expressions were actually there, and there were. I then had another user log in to show them the problem, and the expression magically appeared. I went back and logged out and back in again, and the expressions then appeared for me. On my next school that I committed (and by the way we are doing it twice at each school as work around for initial problem) I had the same problem ­ and same situation where another user could see the expressions and I could not ­ even after I logged out and back in. I even tried logging in with another browser and they still were not displaying for me, but were for 2 other users. I waited about 30 more minutes and checked and again, they magically appeared. I have no explanation for it ­ on one of my schools I saw all the expressions immediately myself, but all other schools so far, the expression is not displaying initially on some sections. Later it seems to fix itself. 
None of the three workarounds worked for me. Every time I re­committed the schedule, the CC records from the previous attempt to commit were not deleted. This resulted in duplicate CC records. Without the ability to disable atomic sync, I could not delete the records via DDA. The duplicate CC records had to be deleted by Tier 2 tech support at Pearson. Everything is working now, but it took 3 days from the first commit to the eventual resolution of the issue. 
After reading all of the above comments. I found the following worked for me.
1. Commit with both options (students and sections).
2. Received errors just mentioned in this KB Article.
3. Ran the commit a 2nd time.
4. Warning that calendar entries would need to be overwritten, but no errors on 2nd schedule commit.
5. Expressions showed up sometimes, other times not so I did a work around found in the knowledge base comments. Go to school setup ­> days. Submit with no changes. Expressions start showing up for me if they wasn't before.
6. Enrollment counts do not show up until Calendar Setup is completed then the following special operations.
7. Reset Section Meetings per KB 10247. Rebuild schedules and rosters
8. Re­align section/schedule enrollments. Reset Class Counts
9. School ­> Regenerate Bitmaps
Finally everything appears normal.
At this point, my despair turned to panic. The errors were still scrolling from my abortive attempted commit two hours earlier. The rate was approximately 1 error every 2.5 seconds, and it was into the MsgNum=2000 range. I wouldn't be able to try any of the workarounds until it finished. How much longer could it take? I opened a tab into our test server and queried for this year's scheduleCC records. (A CC record represents a link between a student and their enrollment in a section of a course.) So, about 775 students who take a mix of mostly semester-long courses and a few year-long courses, with a courseload of about 7 classes per semester, so average is something like 12 unique course sections per year from the database's perspective.

775 students * 12 classes: 9300 records

9300 records * 2.5 seconds to try to copy, fail, and display an error: 23,250 seconds = about 6.5 hours

That's how long I would have to wait before I could attempt the first of three workarounds. Which might lead to the same result.

It was now about noon. Despair, which had become panic, now became rage. I thought of someone who has been a PowerSchool guru to us for several years, whose name I will not put here, since he swore me to secrecy. I expected nothing more than sympathy, but we had the following exchange:
Me: This is completely fucking insane. Do you have any other customers who've run into this? Do you have any suggestions? 
Him: Did I give you the secret instructions for turning off syncing before the commit? 
Me: The special op to turn off atomic sync is gone. Do you know a backdoor? 
Him: Yep! In PowerSchool, Go to System, and in the address location after admin/tech/ type in executecommand.html
You should get a big entry box, in it type:
**DALX_SetGlobalSyncOff
Then Execute
Try the Commit again. Once completed successfully and you’ve run EOY, go back to executecommand.html and:
**DALX_SetGlobalSyncOn
Then Execute
You didn't hear it from me!!!!!! Let me know how it worked! 
Me: OMFG...OK...so now my question is, do I wait for it to finish failing, or do I revert the VMWare snapshot I took right before the shit hit the fan? ...I'm reverting the snapshot. This is going to take many hours otherwise.
(You can't even GET to the executecommand page by clicking links in the web app. You have to type it in. Just for an added bonus.)

OK so, you might recall the part earlier where I mentioned a VMWare snapshot, and how I forgot to shut down the guest first. Well, it turns out, that's a bad idea. The server was not cool after I reverted and powered it on. A nagging mental itch began to form in the recesses of what was left of my mind, and I wrote to my close friend and VMWare master Jordan:
Me: I took a snapshot with the options to NOT save the virtual memory, and to YES quiesce the filesystem. Turns out I needed to revert to that snapshot, which I did and which succeeded/completed. The guest acts like I pulled its power -- had to power it back on, got the Windows "why the unexpected shutdown" message, Oracle's pissed, etc. What did I do wrong?  
Him: It is exactly like you pulled the power plug if you snap when the machine is running. All the quiescing, etc. reduces the chance of corruption. I usually turn the machine off, snap, and then fire it back up. That way if I go back to the snap, it doesn't get unhappy.  
 oh, fuck.
Me: Ah. I forgot to do that. Oh well, I took a datapump beforehand and I'm on with support now. Fuck ME. Thanks. 
Him: It's bad luck, mostly. I mean, I've hard killed or power cycled computers hundreds of times, and they almost always spring right back up. My nightly backups are all like you did it today: crash consistent. Again, how often does something get corrupted if you power cycle a computer? nearly never. 
So I called support, for the 3rd time in one day, and explained the situation. This is the part that finished me off. I said, smugly, "this is not a big deal, because I took a datapump beforehand. It's right there on the desktop. Can you just restore it for me?" And they put me on hold for a bit. Then they came back, and said:
This issue is not something that a datapump restore can fix.
I want you to take a moment to ponder that statement. For eight years, the nightly PowerSchool backup -- the heart and lungs and soul of everything our school district is about -- has been an Oracle datapump, religiously copied off to a server that's over two miles away from the high school, just in case a meteor hits the building with the server room in it. And I had made an EXTRA BACKUP, JUST IN CASE, BEFORE I DID THIS BIG DEAL ANNUAL THING. And now support was telling me, "that's not good enough."

(To be fair, I suppose that if need be, we could've reformatted the hard drive, reinstalled Windows and PowerSchool and Oracle, and THEN imported the clean datapump into the clean environment. Maybe. I'll still never know. It would've taken days, I know that much.)

Anyway, the issue got escalated to Tier 2 -- woohoo! Then...the waiting began. It was now 2:30pm, and my new goal was simply to get things back to where they'd been at 8:30am.

........................

The rest is rather anticlimactic, really. Around 4:00, I saw the mouse start to move, and I watched someone from Tier 2 do some kind of mega-voodoo, and an hour or so after that, PowerSchool and Oracle were happy. I did a few spot checks, ran the guru's "Secret Sauce" to turn off atomic sync, committed the schedule, no errors. Proceeded with steps 19 through 49 without incident. My last note to him was:
IT'S WORKING
COMMIT COMPLETE
You, sir, are a God.
Finished the annual end of year process, which normally takes about two hours total, at about 11:30pm, 15 hours after I'd begun.

(Had a nasty scare two days later, when the middle school scheduling person got a system error while trying to reassign any section, or enroll any kids, or do anything at all, really. 5th support call in one week and they had me run another one of those "special operations" which I've NEVER needed to do before; then, they had me run it a second time. Then, everything worked. Even he didn't know why.)

Moral of the story: Shut down before you take a snapshot, and find an African medicine man to fix your database, I guess.

Wednesday, November 14, 2012

falling behind

This always happens when I try to keep a blog going. In completely random order:

- Added another print release station, on RMS "Staff" printer. Using G4 iMacs ("lamp" style) for this purpose since they're cute and have a small footprint. Now running 3 release stations, would be better with more licenses for the "real" release station client but the web release works OK.

- Got the Intermapper XServe going again. Not really sure why it went down in the first place. I went to do diagnostics on the disk, found nothing wrong, rebooted, and it's been fine ever since. Huh? Anyway, we're close to the 50-device limit and the SSL cert is out of date (probably the software too), so it keeps disconnecting while I'm working on the map. I've also forgotten a lot of the little tricks Fisher showed me, but I have notes somewhere.

- Added two drives to the RMS XServe RAID, expanding the 2nd RAID set to its maximum of 1.5TB. I think. Not sure if I have to reformat/repartition, which would be a pain. I'll give it overnight to finish thinking about itself or whatever.

- Scheduled AlertNow/Connect5 training for everyone@everywhere.com in December. I bet at least 1/3 will not make it.

- Helped RMS populate PS fields properly for new NH state reports. It's ugly, but workable.

- Attended several of MCS's after-school tech trainings.

- Attended Apple's fall tech update in Portsmouth. We're going to need a Bonjour gateway if we go full-bore with Apple TVs.

- Showed Carol at SAU how to do the permissions stuff I do in the NH DoE SSO system.

- Figured out how to enable duplex via web print (create a 2nd printer queue).

- Helped transition control of the Broadside website to a student (Conrad)

- Worked with a network consultant to figure out what it would take to not have to do two-headed DNS. Answer: Complicated. Need another interface on the router.

- Purchased gear for Ray network mini-upgrade. 10 new Airport Extreme base stations, 3 new GB core switches, 6 new GB desktop switches, a bunch of new Cat6 cabling. Planning to do installation today at 2pm with Steve. Ordered little Ikea wall shelves for the new APs but those won't get installed til after Thanksgiving due to shipping time.

- Helped Andrea at HHS with PowerSchool fees reporting stuff (can't do a custom SQL report against a field called "date", thanks PS!) Also helped with reporting/exporting on discipline log entries. Need to document this...special functions, search log entries, then export and you get a mini DDE against the log entries table.

- Upgraded PaperCut at both HHS and RMS.

- Redid the DNS setup at RMS. Separate notes with more detail but basically, the rmsmini is the DNS master, and nameserver slaves to it. They both forward to ns1 and ns2, and both resolve using themselves first. In the event of a fiber bypass, change both sets of forwarders to the Comcast IP addresses. ns1.dresden.us slaves the frms zone from the mini.

- In the process of transferring domains to NHVT.net. Eventually I plan to have them do all our external DNS; switch internal dresden.us SOA to the high school's Mac nameserver; switch internal sau70.org SOA to the SAU mini; replicate the RMS config at both MCS and Ray; and, hopefully, vastly simplify DNS management for my successor(s).

- Rebuilt felix after its boot drive failed; set up a CCC job to a second disk, as we're doing with PaperCut. Screw RAID.

- Rebuilt homer after one of its data drives failed and corrupted the other member of the RAID set;  set up a CCC job to a second disk, as we're doing with PaperCut and felix. Screw RAID all the way. Crashplan restore worked perfectly, and just in case of data loss from the day it went down, I did a full restore using Data Rescue II; but no students came forward complaining of lost info, so that was unnecessary. Nice to know that we technically lost zero data though.

- Fixed (?) the perpetually-jammed print queues at RMS with a script that unjams them. Makes no sense but seems to work.

- Put the January 2012 Chromium image on the library Dell netbooks for use by students to do basic Internet research and Google Docs. No printing, no cute stuff, hopefully easier to maintain for these limited tasks.

- Opened a case with Aruba regarding the connectivity problems on the RMS 6th grade floor. But Sam and I moved an AP up there and haven't heard of any more total cart failures since we did that, so, that might've been the fix. Closed the case with Aruba until or unless issues arise again. I also wonder if the DNS rearrangement on the servers helped stabilize things (maybe the APs were having trouble resolving the controller?) OR maybe now that Sam isn't doing much netbooting/imaging, the mini is less overloaded. We may never know. I've readied a 2nd mini server to take over DNS/DHCP if things get weird again.

- Figured out some reports for Deb at Ray re: attendance comments to help her with her ILI stats and such. The Custom Reports Bundle rocks.

- Put together preliminary budget numbers for the district-wide tech accounts. Added them up and they came within $170 of the current budget. woot.

- Processed several Apple quotes and orders for Marion Cross since Apple forgot how to not charge sales tax in VT.

- Coordinated two major recycling runs with Computer Recycling of Claremont. Probably close to a ton across all four schools.

- Put the "ipad student 1" etc. accounts in a local-email-only suborg to limit the potential for abuse via quasi-anonymous email.

- Attended a webinar on Hapara, an add-on to Google Apps that can give teachers "god rights" over students' data, makes it easy to organize shared docs across a class, etc. MCS might sign up for a section or two and see how it goes.

- Launched a Chromebook evaluation experiment with an RMS 8th grader to get some real-world feedback about its viability as a student device in a 1-1 scenario. Biggest potential pitfall I see thus far is printing. Lots of notes in the tech docs website about that.

- Patched PS gradebook server config to not break with new version of Java

- Met with nurses to discuss PS capabilities and limitations re: screenings, immunizations, and office visits. Got some stuff sorted out, other stuff will take a ReportWorks weenie with more skills than me. Updated the Big Enroll Form again to include a few more tweaks for their convenience.

- Redid the March Intensive signup system for 2013. Need some of the freshman web nerds to port that sucker to something more modern.

- Began process of porting HHS website to Google Sites. Still working out how best to handle the DNS/redirection.

- Set up Tyler Tech remote access to BudgetSense server via persistent Bomgar thingy so we don't have to punch a hole in the firewall anymore.

- Helped John and MCS get going with Mealtime for their lunch program. Mainly the PS export portion, John did the rest.

- Restored Ray and MCS connectivity due to massive power surge that killed several of our power strips/surge suppressors.

- Prepped network and machine guest access for RMS tech night.

- Did MCS tech faire with John & crew.

-  Computed shared cost stuff for PS, Internet connection, Microsoft Office, etc.

- Worked with John L and a talented freshman to arrange some projector/laptop theatrics for a play

- fixed many broken things

Enough for now.

Sunday, September 30, 2012

Friday

It's now Sunday, so I can't remember all of what I did Friday. I spent almost the whole day at RMS triaging a record number of tickets -- mostly little things, mostly unrelated -- the usual, "I can't print", "I can't save", "wifi slow" (that one turned out to be on YWP's end.) We were all running like scalded dogs from 7:45am til 3pm. The print queues keep stopping, and I can't figure out why, but I'll bet it's related to the redirected Desktop/Documents and Lion problems we've been having all year. Also there are still people turning up with personal laptops that have not been brought to us for PaperCut prep, so as I continue to firewall printers, they discover they can't print anymore, and that's how we discover them. Duh. Wrote a snotty note to send out, but at this point I'll probably just wait for them to self-identify after they lose the ability to print. There can't be many left.

My favorite was, a cart laptop that wouldn't boot. 2009 MacBook. Tried netbooting, no dice. We had one just like it with no sound, so I grabbed its hard drive and installed it and still couldn't netboot. No firewire or thunderbolt so no way to transfer the cart image. So, took the damned drive back out and used a USB/SATA adapter to connect it directly to the DeployStudio repo server and tried Disk Utility Restore; it crashed. Tried again on the Mini (so as not to be hammering the hard drive during the school day when two hundred kids are connected) and partway through the restore, the disk self-ejected. I became convinced it was my USB adapter, so drove to the high school to get another one that I'd just used successfully the day before. Also grabbed a brand-new hard drive, just in case. So of course it turned out that I'd replaced one bad hard drive with another one that was also bad; new drive + image = all better. Only took six hours.

Spent the evening chaperoning the RMS dance. Not really. I sat in the lobby trying to turn a Dell 2110 netbook into a Chromebook. Spent most of Saturday and half of Sunday on same, but it's not going well. There's one image from January that works, but it's so old, things like offline Docs/Drive don't work. Not sure what else. I guess we could still use them in school but I worry that they'll be glitchy.

Thursday

failed to actually post on the day of.

- worked with Marty to get the Mealtime PC online and virus-protected -- this so that it can talk to the Mealtime motherbrain and sync parents' lunch account deposits daily

- found source of BOY report errors for Ray and RMS (unset FTEs for new students = incorrect attendance calculations, and garbage data in a state field gets pulled into the report too)

- placed order for free Adobe Digital Collection upgrade for RMS -- I guess we bought maintenance after all. Went ahead and asked for the physical media since it's free.

- evaluated an opensource font to help dyslexic people -- we could copy to all comps via ARD if we wanted to, and there's a Chrome extension available which I haven't tried yet.

- responded to this week's BudgetSense bug -- email settings are somehow fuxxored in Norwich, I am hopeful that their tech Sean will chime in.

- conferred with Jordan again re: new switches for HHS -- going to exchange notes on the quotes we get etc.

- swapped out a teacher MacBook at RMS due to sound issues

- repaired an HHS student's laptop and recovered homework file

- received part for MacBook screen replacement

- PO paperwork more

- evaluated two replaced hard drives to identify one that still works (turns out, they're both screwed)

- attended Minelli's QRCodes training. Turns out my method for embedding audio into a Google Site is totally broken. Dammit! No satisfactory alternate found so far that doesn't involve hacking the HTML directly.

Wednesday, September 26, 2012

Wednesday

- enabled "games" module in Moodle; allows teacher to, e.g., make a hangman game using words from a course glossary.

- tuned energy saver settings for RMS labs -- automatic startup at 7am and shutdown at 4pm on weekdays. Had this last year but needed to restore settings since we reimaged and reconfigured everything.

- attended K12/curriculum meeting with somewhat vague agenda. Discussed MCS tech training day on Oct 5.

- Started the ball rolling on getting the Chrome device management console enabled for our domain. This will let us see what the options would be like for managing carts etc. of ChromeBooks. Also found info about how to turn older netbooks into Chromebooks via Chromium (opensource fork of Chrome project).

- Archived PowerSchool backups. Keeping dumpfiles from 1st of every month going back to 2009 (ish) and keeping every day's dumpfile for past month. Need to make a more regular habit of this.

- Updated various Mac servers, cleared stuck print jobs.

- i4see system granted privileges for Sarah Curtis for Title IIA grants

- Made appointment for October webinar/confcall with SHI's ProCurve specialist to begin speccing out new switches for HHS.

- Helped develop a streamlined workflow for running software updates en masse via ARD

- Established that Friday's DNS weirdness fixed itself; probably related to recent widespread ISP troubles with google.com confirmed by colleague Jordan DesRoches

- Filed accounting paperwork for Aruba support agreement for 12-13

- Tried out new Kurzweil-esque Chrome add-on for text-to-speech and other learning tools in Google Apps -- forwarded to techs and sped director

- Did POs for reimbursement for various things (network port faceplates, USB extenders, USB thumbdrives)

- Made donations/payments to several software companies whose products we use heavily (DeployStudio, Apache, GAM, DiskWarrior, Carbon Copy Cloner, WinClone)