System Overlord

A blog about security engineering, research, and general hacking.

Lying to Google (a.k.a. SEO)

Search Engine Optimization (SEO) comes in two basic forms.  The first really is optimization: ensuring that your site has good links, that the content is relevant, and that the site adheres to good structural practices all fit into true optimization.  With the ever-growing complexity of websites, taking steps to help search engines understand your content and the structure of your site makes good sense.  With the new notion of a "semantic web", this will grow to a new level and become a key part of web development best practices.

The second form of search engine optimization basically amounts to lying to Google.  I say "Google" and not "search engines" because Google's market share has made it so that most SEO amounts to efforts to get the highest Google page rank.  Take, for example, the practice referred to as "Google bombing".  Creating many misleading links to a page in order to have it appear for keywords that have nothing to do with the content is clearly misleading to Google, and misleading to consumers.

A few days ago, Matt Gemmell posted an article entitle "SEO for Non-dicks" where he described the positive world of SEO.  But he also highlighted the unethical practices presented at a SEO conference.  Several companies offering SEO services feature practices like buying links, setting up link farms, and embedding hidden links in pages.  Other practices include hidden (same color as background or underneath other parts of the site) text that may have little or nothing to do with the site.

Because these techniques are designed to mislead search engines (and consequently the consumers using the search engines), these seem to me to amount to a "bait and switch" advertisement.  This is gaming the system.  Fortunately, search engines are making great strides in penalizing the sites that use these misleading practices.

To paraphrase Field of Dreams, "If you build good content, they will come."


Tablets, Free Software, and You

Tablets are the current 'big thing' in computing devices -- so much so, in fact, that many believe tablets will replace most of the uses of laptops and desktops.  This aligns closely with the trend to put "everything" on the web.  While making everything browser-based certainly has its conveniences, it also has risks.

Users are continually placing their privacy and their data in the hands of others, while ignoring the risks posed by these actions.  Look, for example, at the terms of service and software licenses associated with the iPad.  Apple can remotely "kill" software on your iPad.  If that software was storing your data, too bad, it's gone.

What if all your images are stored in a "cloud storage" solution and your provider suddenly decides to increase rates (or decrease your free storage quota)?  Will you pay whatever it takes to get your images back?  How about your email, the videos of your children, or your personal documents?

I'm sure you believe that this won't happen, or that you can just move your data.  If you believe this, take a look at where your data is stored today.  Do you use Microsoft Outlook archives?  I hope you'll never want to load the archive files when you don't have access to Outlook.

While Richard Stallman has pointed out that even Android, based on the open source Linux kernel, probably does not qualify as free software, that's probably not nearly as important as whether or not your data is free.  Even if you chose to use proprietary software, keeping your data free and open lets you move it when you need it.

Tablets and cloud services are two sides of the same coin -- while they might be convenient in the short term, their true costs are well hidden.  For the ease of use, you are giving up substantial amounts of control.  Maybe this is something you're okay with, but you shouldn't be.  I'm not.  Users of the iPad and iPhone routinely "jailbreak" their devices to wrestle some control of their device back.  Why buy a device that requires circumventing the license agreement to use how you want it?  Demand open devices.

Use open, standardized formats that are not encumbered by patents.  Make sure you have access to your data -- its best if you keep your data somewhere to yourself (your own computer, flash drive, or other device).  Don't let companies who care only about their bottom dollar tell you what you can do with your data.

Take control of your devices, take control of your software, and most importantly, take control of your data.


Migrating an Access Database to MySQL

I'm currently taking a Database class as part of my requirements for my M.S. in Computer Science. Several of our assignments are based on a database provided to us as a Microsoft Access Database. While I have a Windows 7 Virtual Machine, and could install Office in it, I prefer to use free software whenever possible, so I looked for a way to use this database with free software.

Fortunately, the database is in the earlier .mdb format, and not the newer .accdb format. I first found a glimmer of hope in an article by Niall Donegan describing the use of the MDB Tools package.

While the steps posted by Niall worked, and worked well, there are a couple of quirks in MDB Tools that took some working around. Additionally, the steps are kind of repetitive. So I decided to write a small wrapper script for mdb-tools to export the data as a MySQL script. The script takes one argument (the name of the mdb file you're working with) and outputs the SQL script on standard output. So, for example, you might use it as: mdb2mysql students.mdb|mysql students. Here's the script (I call it mdb2mysql) itself:

#!/bin/bash
 
if [ $# -lt 1 ] ; then
        echo "Usage: $0 [mdbfile]" > /dev/stderr
        exit 1
fi
 
MDB="$1"
 
# Extract the schema/DDL
mdb-schema $MDB mysql | sed 's/DROP TABLE/DROP TABLE IF EXISTS/'
 
# Extract table data
mdb-tables -1 $MDB | while read TABLE ; do
        mdb-export -I $MDB $TABLE | sed 's/$/;/'
done

Hopefully this helps others who just need to extract their data from an Access Database. It should be noted that this only gets the schema and data, and does not include foreign keys, views, etc.


Using an SSH Connection to Provide Remote Support (Part I)

Last week, at the ALE meeting, a question came up about using SSH to provide remote support for someone who is not especially Linux-literate.  I suggested using an SSH reverse tunnel so the end-user wouldn't need to worry about firewalls, NAT, etc.

Thinking about the problem, I realize that it's a little more complicated than that.  So in part 1, I'm going to discuss the general solution and the approach to the problem.  In Part II, I'll present a more comprehensive solution that will (I think) scale better.

Let's first talk about reverse SSH tunnels.  These tunnels allow a data stream to be carried across the SSH connection in reverse -- that is, from the server to the client.  This is useful for getting back in past a firewall/NAT router/etc. without needing to make configuration changes.

The Basic Premise

First off, let's be clear on the terminology we'll be using.  The "client machine" is the machine being used by the person receiving support.  The "server" is a machine under the control of the person providing the support.

Server Setup

  • Install OpenSSH Client & Server
  • Provide inbound access to SSH (port 22 or alternate port) (this may require firewall changes, router configuration, etc.)
  • Generate a keypair (we'll call this 'reverse.key') to be used to connect back to the client.
  • Create a 'support' account for inbound connections.
  • Set up a DNS entry (dynamic DNS is fine) for the server. We'll call it supporthost.example.com.

Client Setup

  • Install OpenSSH Client & Server
  • Add 'reverse.key.pub' from above to an account 'support' that has sudo access (you'll probably need sudo to provide support.)
  • Generate a private key (we'll call this key 'support.key') and copy off the public portion.  This should be added to the server's 'support' account.

The Script

Place this script, marked executable, on the user's desktop. Double clicking it will allow a support connection in.

 #!/bin/bash
 
ssh -N -R 2222:localhost:22 -i .ssh/support.key support@supporthost.example.com &
 
        echo "Support connection ready!"

Finally

After the script is run, you can ssh -p 2222 support@localhost to connect to their machine via the reverse SSH tunnel.

In the next part, we'll talk about a script to generate most of this for us and make it much easier to set up.


Boost, RSS Feeds, and Google Reader

For a while now, I've struggled with an issue on this site.  Google Reader would sometimes show items that had already been displayed in the reader.  They would be shown as new unread items, regardless of whether the "original" copy of that item had been read.  I'm sure this irritated many readers, and I tried several times to fix the issue.

  • The feed was successfully validated by the W3C Validator.  Multiple times.
  • Adding the feed freshly worked fine.
  • Adding the feed to other RSS readers showed only 1 per item.

I set up a cron job to pull a copy of my RSS feed regularly and save copies.  I figured I could see if anything changed between versions.  At first, the differing versions showed no significant changes.  (Other than new posts where expected.)

At one point, I got a clue from a fellow ALE-NW organizer that the feed was showing duplicate items.  Looking at the view that generated the feed, I realized each tag was causing a duplicate entry.  I deleted the offending relationship, and the number of entries got better.  I figured I had the Google Reader issue fixed.

This weekend, it sprung its head again -- duplicate entries!  I looked back at my cron-based RSS archive and discovered that there were differences in some of the files!  As I looked at the differences, I felt like an idiot.

The first file contained: <guid ispermalink="false">149 at https://systemoverlord.com</guid>.

Another file contained: <guid ispermalink="false">149 at http://tuxteam.com</guid>.

I realized, as I read the differences, that Drupal bases its "base URL" on the URL that is used to access the site. (I used to use tuxteam.com.) This isn't normally a problem, because the RSS reader would be accessing it via the same domain every time, but once you're running Boost, you can get different domains from the cached copies of the files! So, if the cached RSS feed expires and Boost builds a new one on an access from tuxteam.com, the subsequent systemoverlord.com access by Google Reader returns a feed with tuxteam.com-based URLs. These guids are different, and so Google Reader believes they're different articles!

I've now set $base_url = 'http://systemoverlord.com'; in my settings.php. I believe this should finally, permanently, put the duplicate item bug to rest.