Qemu Environments Not Booting Correctly

Deployment: Local installation (in local data center)
EaaSI Version: v2021.10
Browser: N/A
Description:

Any environment I attempt to run based on Qemu are not booting. The same boot screen loads and then nothing else happens. I’ve attached a screenshot.

This issue first occurred when I tried following the EaaSI user guide to install a FreeDOS 1.2 resource. However, it also happens with both Windows environments I’ve tested. I think the issue may be related to me manually importing qemu before saving the environments to my local node.

The environments that do not boot are:

  1. Windows 95 Test Env
  2. Windows XP Professional 2002 SP3 32 Bit + Microsoft Office Home and Student 6994

Are you able to reproduce the issue or did it happen once? What steps can you take to repeat the issue? What did you expect to occur and what was the actual outcome?

This happens whenever I try to boot either environment. I know the Windows 95 Test Env was previously working on my earlier installation.

Urgency: If possible, please give an indication of how urgently the issue needs to be addressed - is there a timeline or deadline (e.g. upcoming demo, researcher request, etc.) that EaaSI support staff should be aware of?

I’m working on a summer research project with EaaSI and trying to get it functioning for student evaluative use. I can try a fresh installation again if needed.

Hey @ekaltman - the process of saving environments to your node should copy over any necessary QEMU container automatically, if that Public environment is configured to run a particular emulator/version that you don’t yet have in your node. And it seems like QEMU is running here (we’ve seen issues with the emulators not copying over properly in the past and in such cases when trying to run the copied environment you get nothing in the Access interface at all and probably an error thrown by the UI) - it just can’t find the bootable device (system drive), so something may have gone wrong in copying over the underlying QCOW images of the environment.

We could confirm that by taking QEMU/the emulators off the table - does a SheepShaver or Basilisk-based Mac environment work? E.g. “Apple Mac OS 7.5”?

Either way - could you also try saving a different PC/QEMU-based environment to your node and then immediately download a server log from Manage Node → Troubleshooting → Download Server Log and provide?

I downloaded “Mac OS 7.5 + Ready Set GO 4.5a” and it failed to load.

When I try to download the server.log file through the Web UI nothing happens, it goes to the error reporting URL and hangs. It does correctly return the front-end logs however.

The server.log below shows the error reporting error:
eaasi_error_reporting_issue_20220609.log (1.8 MB)

There also appears to be some local access errors in there, some of the /eaasi/ subfolders appear to be owned by root instead of the eaasi user?

I’m attaching the current server.log file as well. I initiated a save operation for “Windows 98 SE + Microsoft Works 8” around two hours ago that has still not finished. I can send another update early tomorrow once I’m back in the office.

Current eaasi server.log:
eaasi_error_qemu_import_20220609.log (1.9 MB)

@ekaltman, according to the provided log, all of your attempts to import and replicate environments were successful. Your specific “Windows 98” image is about 20GB in size, so depending on your internet connection it might take longer to import. Please also make sure, that you have enough storage space available.

But, starting emulation sessions with imported images indeed fail. The following output might contain more information:

$ sudo docker logs nginx > nginx.log

Generating error-reports fails because of file permission problems, which should not happen. Could you please post the output of the following command here too:

$ sudo tree -pugh /eaasi > eaasi-tree.txt

Updated log after process completed:
eaasi_error_qemu_import_complete_20220609.log (1.9 MB)

Nginx Log and Tree Output
eaasi_nginx_20220609.log (883.7 KB)
eaasi_tree_20220609.txt (430.9 KB)

My connection is from the data center so it should be rather fast. The VM I’m running EaaSI on has ~300GB of space available.

@oooleg it appears that the permissions for some of the tree are not owned by the eaasi user. I’m going to try changing the permissions to eaasi to see if that fixes anything. I guess let me know if the logs show any other potential issues.

@ekaltman, what is the output of the following command executed on your server:

$ sudo id eaasi ekaltman

It looks like some of the data was created under different users. We normally do not change filesystem ownership during installation. But it might also be caused by a mismatch of host vs. container users, under which the backend is run inside of the containers.

the exact command returns an error, individually however

ekaltman@eaasi:~$ sudo id eaasi
uid=1002(eaasi) gid=1002(eaasi) groups=1002(eaasi),27(sudo),108(lxd),999(docker)
ekaltman@eaasi:~$ sudo id ekaltman
uid=1000(ekaltman) gid=1000(ekaltman) groups=1000(ekaltman),4(adm),24(cdrom),27(sudo),30(dip),46(plugdev),108(lxd)

OK, thanks. This looks like a user-id mismatch. Backend is always running under user 1000 in the container, which maps to ekaltman user on the host. Then, processes inside of the container running under user 1000 fail to read some files owned by 1002 (eaasi) user on the host. That is the reason why generating error-reports fails.

If possible, could you try to install EaaSI as ekaltman user? It should be enough to run:

$ sudo systemctl stop eaas
$ sudo chown -R ekaltman:ekaltman /eaasi
$ sudo systemctl start eaas

Currently, replication of environments is automatically aborted after a timeout of 1 hour. According to the log, importing “Windows 98” images seems to take longer on your server and you are running into that timeout there.

I have increased the timeout for you to try. Just change your eaasi.yaml to:

eaas:
  version: "v2021.10.1-eaasi"

and update your server:

$ ./scripts/update.sh ear

I reinstalled and changed the version to enable the new timeout. I also changed ownership of the /eaasi folder to the “ekaltman” user.

The server log download now works, though I’m not sure which key I need to use to decrypt the gpg file.

I attempted to download the Windows 95 Test Env and it now failed rather quickly. Server log is attached below.
eaasi_environment_download_error_20220615.log (715.1 KB)

The error with QEMU not booting the environment is the same, I’m thinking that the node is not downloading the environments correctly.

Any environment I attempt to download claims to have finished and “Saved Locally” but still has the “Run in Emulator” faded. When I go into details I can try to run the emulation but for Qemu is just boots to a “no bootable device” screen. Mac environments immediately fail before boot, which seems similar to the CMU issue.

There is definitely some issue with downloading the environments correctly, don’t know if that is on my end or the eaasi system end however. @ethan.gates I will probably try just installing and configuring environments locally, I’ll look out for office hours.

At this point, I don’t really know what else to try, if we want to schedule some live debugging let me know.

I ran a tail on the server.log when I try to initialize the Windows 98 run-time. It appears to not be finding the correct data.
eaasi_environment_download_error_20220616.log (40.6 KB)

The Windows 98 image is downloaded to the server, as are the other images, so I’m not sure what the issue is. I’m doing a fresh reinstallation on a fresh VM to see if there are other issues.

According to your log downloading Windows 95 environment was successful:

2022-06-15 16:22:23.175 |I| (EE-ManagedExecutorService-io-Thread-4) [IMAGE-ARCHIVE] (service:imports) Executing import-task 3 took 102 second(s)
2022-06-15 16:22:23.212 |I| (EE-ManagedExecutorService-io-Thread-4) [IMAGE-ARCHIVE] (service:imports) Executing import-task 2 took 111 second(s)
2022-06-15 16:22:23.251 |I| (EE-ManagedExecutorService-io-Thread-4) [IMAGE-ARCHIVE] (service:imports) Executing import-task 1 took 119 second(s)
2022-06-15 16:22:23.280 |I| (EE-ManagedExecutorService-io-Thread-3) [IMAGE-ARCHIVE-CLIENT] Replicated environment '56696004-84ff-4d5f-8596-407caa07c030'

Hi @oooleg, yes, the environments were downloading but they loading into a “no bootable device” qemu screen. I tried to wipe the server and reinstall everything again but am now encountering a new issue. The setup fails again at:

TASK [wait for eaas-server to start up] ***************************************************************************************************************************************************************************************************
fatal: [eaas-gateway]: FAILED! => changed=false
  attempts: 1
  content: ''
  elapsed: 0
  msg: 'Status code was -1 and not [200]: Request failed: <urlopen error [Errno -2] Name or service not known>'
  redirected: false
  status: -1
  url: https://eaasi.csuci.edu/emil/environment-repository/actions/prepare

However, last time the EaaSI system was still reachable, it appears that now the nginx docker container (not eaasi-nginx) is failing to start. From docker ps:

ekaltman@eaasi:/$ docker ps
CONTAINER ID   IMAGE                                                                 COMMAND                  CREATED         STATUS                          PORTS                                       NAMES
321719e6fd07   nginx:stable                                                          "/docker-entrypoint.…"   5 minutes ago   Restarting (1) 26 seconds ago                                               eaasi-nginx
da9ff1d72056   eaas/eaas-appserver:v2021.10-eaasi                                    "/init"                  5 minutes ago   Up 5 minutes                                                                eaas
0fab64605bac   registry.gitlab.com/eaasi/eaasi-client-pub/eaasi-web-api:v2021.10     "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes                    0.0.0.0:8081->8081/tcp, :::8081->8081/tcp   eaasi-web-api
5dfa2a6059b7   jboss/keycloak:15.0.2                                                 "/opt/jboss/tools/do…"   5 minutes ago   Up 5 minutes                    8080/tcp, 8443/tcp                          keycloak
1c29c3068b13   registry.gitlab.com/eaasi/eaasi-client-pub/eaasi-front-end:v2021.10   "/docker-entrypoint.…"   5 minutes ago   Up 5 minutes                    0.0.0.0:8080->80/tcp, :::8080->80/tcp       eaasi-front-end
0fe466dd0a0c   registry.gitlab.com/eaasi/eaasi-client-pub/eaasi-database:v2021.10    "docker-entrypoint.s…"   5 minutes ago   Up 5 minutes                    0.0.0.0:5432->5432/tcp, :::5432->5432/tcp   eaasi-database
638e6d1c543c   minio/minio:RELEASE.2021-11-03T03-36-36Z                              "/usr/bin/docker-ent…"   5 minutes ago   Up 5 minutes                    9000/tcp                                    minio

I can start a new ticket for this error if you like. docker logs nginx produces:

2022/06/17 19:03:09 [emerg] 1#1: unknown directive "js_include" in /etc/nginx/nginx.conf:20
nginx: [emerg] unknown directive "js_include" in /etc/nginx/nginx.conf:20

@ekaltman, you probably have removed all containers recently. The new nginx container seems to be not compatible with the config from EaaSI anymore.

As a workaround, you should now update the installer as follows:

$ cd <eaasi-installer>/eaas/ansible
$ git fetch origin
$ git checkout origin master

Then re-run the installer again, ignoring the last startup error.

Actually, images seem to be correctly replicated but are then not found when setting up an emulation session. I’m not yet sure what is the root cause for this in your case, especially if the server is constantly tweaked and modified :slight_smile:

Is there any chance to get ssh-access to your machine? Or, alternatively, we could schedule a zoom call and share screen? That might be simpler to find the root cause for your issues.

Okay, that fixed the nginx issue and I appear to have access to the interface again. Just a note that the initial pull did not work:

➜  eaasi-installer git:(a2d2e9e) cd eaas/ansible
➜  ansible git:(60885d4) git fetch origin
➜  ansible git:(60885d4) git checkout origin master
error: pathspec 'master' did not match any file(s) known to git

So I just ignored the origin for both commands and the pull worked.

yeah, sorry… it should be:

$ git checkout origin/master

But, leaving out remote prefix should also work in this case.

@oooleg the ssh access is currently limited to our VPN, I could open up the ssh access publicly but I think a screenshare might be the most effective. I’m free for most of this afternoon until 4pm PT and again on Monday or Tuesday for middle of the day times (10am-3pm PT).

And just an edit to say thank you for working through this with me!