Getting your Code to Run in Awkward Places

The Program

In my Linux Programming class we were given a webscraping assignment.
This consisted of writing a bash script as the main point of execution. The bash script downloads a webpage and calls a provided java program to convert the .html to .xhtml.

Next, the bash script calls a python program I wrote that traverses the xhtml using xml.dom.minidom and pulls the desired data off of the .xhtml document. The python program will next create a mysql connection and insert the collected data into a mysql database.

When the python program finishes execution returns to bash. All of this work sits in a bash function so that I can loop it every minute and collect the data at a useful frequency and while still staying within ICANN regulation.

Why xHTML?

Html was not very standardized in it's early days. This resulted in some webpages possibly breaking the xml guidelines and browser developers having to accomodate for these special conditions.

xHTML was introduced which enforces the xml guidelines and provides some form of portability between browsers.

In the assignment this is an added benefit as it allows us to safely navigate the webpage using a python xml library.

About the Environment

We were tasked to use mysql, html, and php to display our data. This means we were required to install and configure a LAMP stack on our machines in order to use apache to host the html and php code.

My university provides a distributed system dubbed AFS which is an unimportant acroynym. Each student is given a Linux environment running on a RHEL 7 system. In our Linux environment we are given ~/public_html which is served, by apache, to a students personal subdirectory on the University's domain.

This is a good place to put student profile pages and where I personally practice Vanilla web technologies such as plain css, vanilla js and html.

The AFS environment is customized and locked down pretty well. Students have access to python, js, bash, ksh, gcc, g++, and a few other languages. It is also not very well known, however each student can request one mysql database to work with by filling out a form. Therefore, I can have an full LAMP stack provided to me by the school and can do all my work from AFS.

There is one caveat, we do not have access to any package manager on AFS, This includes pip, dnf, npm etc.
Because my python program depends on xml.dom.minidom and mysql.connector this poses a problem if I wanted to run my code on the University's servers.

What did I do?

I have pip3 on my notebook and can easily use these modules in my python programs after a pip3 install --user mysql.connector xml.dom.minidom.
After a short google search and look through the pip documentation I found the -t flag which specifies a target directory for the python package to be installed to.

After running pip3 install mysql.connector -t ./ I had the necessary python modules to rsync up to the host. I used rsync -a because the modules contain several subdirectories and archive mode preserves the directories and their structure.

Forcing Python to use the Modules

After some research I learned that I should not have the executable files in the working directory index.php which served my webpage, this is because it would make them executable to users who type them into their browsers address bar.
I moved all executable files down into a subdirectory called 'res', I then adjusted all of my html to point into that subdirectory.

Therefore my project structure looks something like this:

.
├── getHTML.sh
├── index.php
├── parseHTML.py
├── res
│   ├── mysql
│   │   ├── connector
etc etc

There is one last issue. Python imports by default in it's working directory so it wouldn't see the modules in ./res/ without some intervention.
At the top of my python, after my imports from the std library I added the following lines:

sys.path.append(os.path.join(os.getcwd(), "res"))
import mysql.connector

What this does is it moves pythons execution path by intelligently joining the absolute path to the current working directory with the 'res' directory. When this path is appended onto the end of the execution path python will have effectively cd'd into the res directory before it looks for mysql.connector.

Conclusion

You can meet some unique challenges when programming for remote hosts, especially when you are met with restrictions however with some minor workarounds you can achieve more than you think you are capable of despite restrictions.

I've also learned some more about how flexible python can really be, I look forward to working and thinking outside of the box with future projects.