Wednesday 17 June 2015

Multiple XML Sitemaps: Increased Indexation as well as Traffic!

Utilizing XML sitemaps and sitemap indices is an advanced tactic.

What Are XML Sitemaps?

Simply put, an XML sitemap is a bit of Extensible Markup Language (XML), a standard machine-readable format consumable by search engines and other data-munching programs like feed readers.

There are standard single XML sitemaps: one file of XML code explaining to the search engines what pages are important. This is a set of instructions to the search engines, and are more guidelines rather than rules. Posting an XML sitemap is kind of like rolling out the red carpet for search engines and giving them a roadmap of the preferred routes through the site.

From One to Many

The best way to break that out to many sitemaps is a matter of how your site is structured. Do you have a blog based system with categories and content in each category? Do you have sets of products? Or many locations for your business?

Simple: Groups of 100 pages per sitemap (or 1000, or 10000, but try to keep it smaller)
Better: Static Pages (homepage, about, etc.), Products, Blog
Best: Static, Categories, Subcategories, Locations, Blog by Date, etc.

Thursday 11 June 2015

Google Lite!

You can use Google to search for just about anything and get answers in a flash. But, when you’re searching on your phone, a slow connection can really hold things up.

To make sure you’re not waiting on Google when you need it most, we’ve rolled out a streamlined Search results page that loads fast, even on those slow connections. You’ll get all  the info you need in a simpler format that’s beautiful and easy to use. Best of all: there’s nothing new to download or  update — the lighter version will kick in automatically when needed!

Google is basically showing them a toned down version of your web page, stripping out the heavy images and files, on a special Google URL.

Google added a help page that lets you test your web page through the optimized Google version. The URL is http://icl.googleusercontent.com/?lite_url=[your_website_URL], replace [your_website_URL] with your URL, for example, this site would be https://icl.googleusercontent.com/?lite_url=http://yahoo.com&re=1

Google said they are testing this in Indonesia for users on slow connections such as 2G speed. Google said the tests have shown thus far to load four times faster than the original page and use 80% fewer bytes, plus they saw a 50% increase in traffic to these optimized pages.

Monday 8 June 2015

What is .htaccess?

.htaccess files (or “distributed configuration files”) provide a way to make configuration changes on a per-directory basis. A file, containing one or more configuration directives, is placed in a particular document directory, and the directives apply to that directory, and all subdirectories thereof.

Directives

“Directives” is the terminology that Apache uses for the commands in Apache’s configuration files. They are normally relatively short commands, typically key value pairs, that modify Apache’s behavior. An .htaccess file allows developers to execute a bunch of these directives without requiring access to Apache’s core server configuration file, often named httpd.conf(httpd.conf typically referred to as the "global configuration file")

Enabling .htaccess:

.htaccess files are normally enabled by default. This is actually controlled by the AllowOverride Directive in the httpd.conf file. This directive can only be placed inside of a <Directory> section. The typical httpd.conf file defines a DocumentRoot and the majority of the file will contain Directives inside a <Directory> section dealing with that directory. This includes the AllowOverride directive.

The default value is actually “All” and thus .htaccess files are enabled by default. An alternative value would be “None” which would mean that they are completely disabled. There are numerous other values that limit configuration of only certain contexts.

Checking if .htaccess is Enabled:

1 - is_htaccess_enabled

This test case is very simple. It uses a directive to make Apache look first for the “indexgood.html” file before “index.html.” If .htaccess support is enabled, when you point your browser to the folder, Apache will load the .htaccess file and know that it should display the “indexgood.html” page

# This Directive will make Apache look first
# for "index_good.html" before looking for "index.html"
DirectoryIndex index_good.html index.html

2 - is_htaccess_enabled_2
Any syntax error in your .htaccess file will cause the server to hiccup.
You can use this to your advantage to test if your server has .htaccess support enabled!

# This file is intended to make Apache blow up.  This will help
# determine if .htaccess is enabled or not!
AAAAAA

So, if you get back a page yelling
“Internal Server Error” then your server is looking for .htaccess files!

Consequences of .htaccess files:

- Always keep in mind that you’re affecting all of the subdirectories as well as the current directory.
- Also, when enabled the server will take a potential performance hit. The reason is because, every server request, if .htaccess support is enabled, when Apache goes to fetch the requested file for the client, it has to look for a .htaccess file in every single directory leading up to wherever the file is stored.

Headers

Let's start simple, we'll add a header to the Response and see what happens:

# Add the following header to every response
Header add X-HeaderName "Header Value"

I ran a test to check to see if certain modules were enabled on a web-server. I wrote the following check:

<IfModule mod_gzip.c>
  Header add X-Enabled mod_gzip
</IfModule>
<IfModule mod_deflate.c>
  Header add X-Enabled mod_deflate
</IfModule>

There is a difference between Header set and Header add. With add, the Header will always get added to the Response. Even if it happens to show up multiple times in the response. This is most often what you would want for custom headers. You would use set when you want to override the value of one of the default headers that Apache returns.

Custom error documents..

the usual method. the "err" folder (with the custom pages) is in the root
# custom error documents
ErrorDocument 401 /err/401.php
ErrorDocument 403 /err/403.php
ErrorDocument 404 /err/404.php
ErrorDocument 500 /err/500.php

# quick custom error "document"..
ErrorDocument 404 "<html><head><title>NO!</title></head><body><h2><tt>There is nothing here.. go away quickly!</tt></h2></body></html>

Save bandwidth with .htaccess!

it enables PHP's built-in transparent zlib compression.
<ifModule mod_php4.c>
 php_value zlib.output_compression 16386
</ifModule>

Expire Header

<IfModule mod_expires.c>
    ExpiresActive on

    ExpiresByType image/jpg "access plus 1 month"
    ExpiresByType image/jpeg "access plus 1 month"
    ExpiresByType image/gif "access plus 1 month"
    ExpiresByType image/png "access plus 1 month"
</IfModule>

Compression by file extension
<IFModule mod_deflate.c>
<filesmatch "\.(js|css|html|jpg|png|php)$">
SetOutputFilter DEFLATE
</filesmatch>
</IFModule>

how to add the Vary Accept-Encoding header:

<IfModule mod_headers.c>
  <FilesMatch "\.(js|css|xml|gz)$">
    Header append Vary: Accept-Encoding
  </FilesMatch>
</IfModule>

compression by file type

####################
# GZIP COMPRESSION #
####################
SetOutputFilter DEFLATE
AddOutputFilterByType DEFLATE text/html text/css text/plain text/xml application/x-javascript application/x-httpd-php
BrowserMatch ^Mozilla/4 gzip-only-text/html
BrowserMatch ^Mozilla/4\.0[678] no-gzip
BrowserMatch \bMSIE !no-gzip !gzip-only-text/html
BrowserMatch \bMSI[E] !no-gzip !gzip-only-text/html
SetEnvIfNoCase Request_URI \.(?:gif|jpe?g|png)$ no-gzip

Header append Vary User-Agent env=!dont-vary
#End Gzip

Image Expire Tag

First we enable expirations and then we set a default expiry date for files we don't specify.
<IfModule mod_expires.c>
# Enable expirations
ExpiresActive On 
# Default directive
ExpiresDefault "access plus 1 month"
</IfModule>

# My favicon
ExpiresByType image/x-icon "access plus 1 year"
# Images
ExpiresByType image/gif "access plus 1 month"
ExpiresByType image/png "access plus 1 month"
ExpiresByType image/jpg "access plus 1 month"
ExpiresByType image/jpeg "access plus 1 month"
# CSS
ExpiresByType text/css "access plus 1 month"
# Javascript
ExpiresByType application/javascript "access plus 1 year"

Thursday 4 June 2015

Page Title Pixel Meter - Google not counts characters but pixels of a page title

The maximum width of a page title is between 466 and 496 pixels! (not by characters)
Since Google determines all by itself what to show as page title on a SERP, you better make sure that your expanded page title is relevant to the content.
E.g. filling a page title with 121 “|” characters won’t work – Google will replace your page title on the SERP.

Pixel width makes total sense for the search engines to use for their SERPs, but does make it harder for webmasters and SEOs to have control over their search snippets which is genuinely frustrating. Hence, best practice is really important here, ensure your key phrases are at the beginning of titles in particular to increase the likelihood they will be visible.

Preview Tool

Wednesday 3 June 2015

Rel Canonical - How To

What is the canonical tag?

First of all, we can't seem to agree on what to call it. Rest assured that 'rel canonical', 'rel=canonical', 'rel canonical tag', 'canonical url tag', 'link canonical tag' and simply 'canonical tag' all refer to the same thing.

The canonical tag is a page level meta tag that is placed in the HTML header of a webpage. It tells the search engines which URL is the canonical  version of the page being displayed. It's purpose is to keep duplicate content out of the search engine index while consolidating your page’s strength into one ‘canonical’ page.

Lets look at a list of common duplicate content URLs.

http://example.com/quality-wrenches.htm (the main page)
http://www.example.com/quality-wrenches.htm
http://example.com/quality-wrenches.htm?ref=crazy-blog-lady
http://example.com/quality-wrenches.htm/print

How is it implemented?

The canonical tag is part of the HTML header on a webpage. This is the same place like the title tag. The code, as in my example above, would look like this.

<link rel="canonical" href="http://example.com/quality-wrenches.htm"/>

There is usually a better solution

The canonical tag is not a replacement for a solid site architecture that doesn’t create duplicate content in the first place.
Lets go through some of the URL examples I provided above, this time we'll talk about how to fix them without the canonical tag.

Example 1: http://www.example.com/quality-wrenches.htm

Common mistakes with rel=canonical

Mistake 1: rel=canonical to the first page of a paginated series

Imagine that you have an article that spans several pages:
example.com/article?story=cupcake-news&page=1
example.com/article?story=cupcake-news&page=2
and so on
Specifying a rel=canonical from page 2 (or any later page) to page 1 is not correct use of rel=canonical, as these are not duplicate pages. Using rel=canonical in this instance would result in the content on pages 2 and beyond not being indexed at all.

Mistake 2: Absolute URLs mistakenly written as relative URLs

The <link> tag, like many HTML tags, accepts both relative and absolute URLs. Relative URLs include a path “relative” to the current page. For example, “images/cupcake.png” means “from the current directory go to the “images” subdirectory, then to cupcake.png.”

Mistake 3: Unintended or multiple declarations of rel=canonical

Occasionally, we see rel=canonical designations that we believe are unintentional.

Mistake 4: Category or landing page specifies rel=canonical to a featured article
if you want users to be able to find both the category page and featured article, it’s best to only have a self-referential rel=canonical on the category page, or none at all.

Mistake 5: rel=canonical in the <body>

The rel=canonical link tag should only appear in the <head> of an HTML document. Additionally, to avoid HTML parsing issues, it’s good to include the rel=canonical as early as possible in the <head>. When we encounter a rel=canonical designation in the <body>, it’s disregarded.

Why Your Site Might Not Get Indexed

Typically these kinds of issues are caused by one or more of the following reasons:


  • Robots.txt - This text file which sits in the root of your website's folder communicates a certain number of guidelines to search engine crawlers. For instance, if your robots.txt file has this line in it; User-agent: * Disallow: / it's basically telling every crawler on the web to take a hike and not index ANY of your site's content.
  • .htaccess - This is an invisible file which also resides in your WWW or public_html folder. You can toggle visibility in most modern text editors and FTP clients. A badly configured htaccess can do nasty stuff like infinite loops, which will never let your site load.
  • Meta Tags - Make sure that the page(s) that's not getting indexed doesn't have these meta tags in the source code: <META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
  • Sitemaps- Your sitemap isn't updating for some reason, and you keep feeding the old/broken one in Webmaster Tools. Always check, after you have addressed the issues that were pointed out to you in the webmaster tools dashboard, that you've run a fresh sitemap and re-submit that.
  • URL Parameters - Within the Webmaster Tools there's a section where you can set URL parameters which tells Google what dynamic links you do not want to get indexed. However, this comes with a warning from Google: "Incorrectly configuring parameters can result in pages from your site being dropped from our index, so we don't recommend you use this tool unless necessary."
  • You don't have enough Pagerank - The number of pages Google crawls is roughly proportional to your pagerank.
  • Connectivity or DNS issues - It might happen that for whatever reason Google's spiders cannot reach your server when they try and crawl. Perhaps your host is doing maintenance on their network, or you've just moved your site to a new home, in which case the DNS delegation can stuff up the crawlers access.
  • Inherited issues - You might have registered a domain which had a life before you. I've had a client who got a new domain and did everything by the book, but Google refused to index them, even though it accepted their sitemap. After some investigating, it turned out that the domain was used several years before that, and part of a big linkspam farm. We had to file a reconsideration request with Google.