The Program
In my Linux Programming class we were given a webscraping assignment.
This consisted of writing a bash script as the main point of execution. The bash script downloads a webpage and calls a provided java program to convert the .html
to .xhtml
.
Next, the bash script calls a python program I wrote that traverses the xhtml
using xml.dom.minidom
and pulls the desired data off of the .xhtml
document. The python program will next create a mysql connection and insert the collected data into a mysql database.
When the python program finishes execution returns to bash. All of this work sits in a bash function so that I can loop it every minute and collect the data at a useful frequency and while still staying within ICANN regulation.
Why xHTML?
Html was not very standardized in it's early days. This resulted in some webpages possibly breaking the xml guidelines and browser developers having to accomodate for these special conditions.
xHTML was introduced which enforces the xml guidelines and provides some form of portability between browsers.
In the assignment this is an added benefit as it allows us to safely navigate the webpage using a python xml library.
About the Environment
We were tasked to use mysql, html, and php to display our data. This means we were required to install and configure a LAMP stack on our machines in order to use apache to host the html and php code.
My university provides a distributed system dubbed AFS which is an unimportant acroynym. Each student is given a Linux environment running on a RHEL 7 system. In our Linux environment we are given ~/public_html
which is served, by apache, to a students personal subdirectory on the University's domain.
This is a good place to put student profile pages and where I personally practice Vanilla web technologies such as plain css, vanilla js and html.
The AFS environment is customized and locked down pretty well. Students have access to python, js, bash, ksh, gcc, g++, and a few other languages. It is also not very well known, however each student can request one mysql database to work with by filling out a form. Therefore, I can have an full LAMP stack provided to me by the school and can do all my work from AFS.
There is one caveat, we do not have access to any package manager on AFS, This includes pip
, dnf
, npm
etc.
Because my python program depends on xml.dom.minidom
and mysql.connector
this poses a problem if I wanted to run my code on the University's servers.
What did I do?
I have pip3 on my notebook and can easily use these modules in my python programs after a pip3 install --user mysql.connector xml.dom.minidom
.
After a short google search and look through the pip
documentation I found the -t
flag which specifies a target directory for the python package to be installed to.
After running pip3 install mysql.connector -t ./
I had the necessary python modules to rsync
up to the host. I used rsync -a
because the modules contain several subdirectories and archive mode preserves the directories and their structure.
Forcing Python to use the Modules
After some research I learned that I should not have the executable files in the working directory index.php
which served my webpage, this is because it would make them executable to users who type them into their browsers address bar.
I moved all executable files down into a subdirectory called 'res', I then adjusted all of my html to point into that subdirectory.
Therefore my project structure looks something like this:
.
├── getHTML.sh
├── index.php
├── parseHTML.py
├── res
│ ├── mysql
│ │ ├── connector
etc etc
There is one last issue. Python imports by default in it's working directory so it wouldn't see the modules in ./res/
without some intervention.
At the top of my python, after my imports from the std library I added the following lines:
sys.path.append(os.path.join(os.getcwd(), "res"))
import mysql.connector
What this does is it moves pythons execution path by intelligently joining the absolute path to the current working directory with the 'res' directory. When this path is appended onto the end of the execution path python will have effectively cd
'd into the res directory before it looks for mysql.connector
.
Conclusion
You can meet some unique challenges when programming for remote hosts, especially when you are met with restrictions however with some minor workarounds you can achieve more than you think you are capable of despite restrictions.
I've also learned some more about how flexible python can really be, I look forward to working and thinking outside of the box with future projects.