Interview with Semalt.com

Yesterday I posted about Semalt.com’s crawler and their unusual choice not to have their crawlers identify themselves as web crawlers or obey robots.txt, causing heartaches for analytics loving webmasters across the web. Semalt’s manager Alex Andrianov reached out through twitter and offered to answer some of my questions via email. The exchange is included in whole below.

Hi, thanks for taking the time to chat with me in a bit more detail about Semalt. Happy to update my blog post with factual corrections you’re able to provide.You’ve mentioned on twitter that Semalt does not obey robots.txt, further saying that “can’t change it”. Could you explain in a bit more detail what keeps Semalt’s bots from identifying themselves as bots or obeying robots.txt? Is this a talent issue, where your developers haven’t been able to discover the processes to undertake this, or is this part of a business decision on Semalt’s part?Are there plans in the future to have Semalt’s bots identify themselves properly as crawlers and to obey robots.txt?

You also claimed that my comments at http://www.closetoclever.com/semalt-com/ were incorrect, as I was not a Semalt client. Were there any specific factual errors that you would like to address?

Thanks again for taking the time to answer these,

Jessica Rose

 

Hello Jessica RoseThanks for your email.First of all I would like to bring apology on behalf of my company if our bots caused you some difficulties. I can assure you, all the visits on your website were accidental. At this moment our specialists are taking drastic actions to prevent these visits. Thank you for pointing to our service drawbacks. We appreciate your help and it is very important to us.

Our service has been launched quite recently and unfortunately there are still some bugs and shortcomings. Please, respect this fact. We are working hard trying to fix the existing errors and I hope soon our users won’t have any claims.

As you might notice, every user can manually remove their URLs from the Semalt database using Semalt Crawler. Furthermore, our Support center specialists are ready to come to the aid and remove URLs from the base once the website owner submits a request. We consider every single request and guarantee that every user will get a proper respond.

We realize this may bring some inconveniences, but unfortunately at the moment we can’t offer another way of solving this issue.

As for the comment posted on your blog, I believe it’s impossible to evaluate all the pros and cons unless you have the complete picture of the service. Probably once you try to use Semalt features you will change your mind.

Anyway, we thank you for your feedback, since we appreciate every opinion relating Semalt.

Sincerely yours,

Semalt LLC manager , Alex Andrianov

 

Thanks for the response, but would it be possible to have you address my specific questions more directly?1. Are you claiming that your bots’ failure to identify themselves as web crawlers is due to a technical failure?
2. Are you claiming that your bots not obeying robots.txt is due to a technical failure?
3. Do you have plans to make your bots identify themselves as web crawlers?
4. Do you have plans to have your bots comply with robots.txt?Jessie

 

Dear Jessica,I will try to give the most definite answers to your questions. As I mentioned before our service has recently appeared on the web which causes some technical unavailability. Today we upgrade the web scanning process and adjust our robots. Unfortunately sometimes Semalt bots visit random websites, but we do all our best to solve this problem in the shortest possible time.Thank you for your email and interest to Semalt.com service. Your opinion is very important to us.

Sincerely yours,

Semalt LLC manager, Alex Andrianov

I’m not sure that’s answering much. I’m really looking to find out:1. Will your crawlers be respecting robots.txt after your upgrade?
2. Will your crawlers be identifying themselves as web crawlers after your upgrade?Jessie

He hasn’t yet replied to this email, but responded to tweets on the subject:
semalt3
semalt4

What we learned from this exchange:

Nothing, really. There were some vague claims that the problems I’ve listed were “bugs” but no specific addressing of the problems of Semalt bots ignoring robots.txt or failing to properly identify themselves as web crawlers. Apparently several weeks of visits to sites across the web were “accidental”.

Why this is nonsense:

Given how easy creating robots.txt compliant crawlers are, failure of bots to identify themselves as web crawlers or obey robots.txt can only be viewed as a deliberate choice of the designer or gross incompetence. While my technical skills are also substandard, I’m confident that I would be able to put together a simple webcrawler that obeys robots.txt over the weekend (check in on Tuesday, I’ll be posting the results of my efforts). For a professional enterprise who sources data through crawling the web to claim following industry conventions is beyond their technical ability leaves me wondering if they’re fools or liars.

What is Semalt.com?

If you’re keeping track of your website’s traffic through Google Analytics, you’ve probably noticed referral visits from a website called semalt.com in recent weeks. Semalt is a web crawler designed to gather data for Senmalt’s marketing platform. The visits showing up in your logs are automated programs interacting with your site.

The difference between Semalt.com and reputable crawlers

If you look through your Google Analytics referral data, you’ll notice that the other large web crawlers such as Googlebot, MJ12bot, Rogerbot and Bingbot don’t show up in your logs. Semalt’s crawlers showing up in your traffic logs is unusual because most bots identify themselves as web crawlers and will thus be excluded from your traffic data. This results in skewed traffic data, especially for smaller sites for whom semalt.com vists make up a larger percentage of their traffic.

Semalt also doesn’t respect robot.txt (a easy way for webmasters to keep bots from their sites) instead asking that concerned webmasters seek them out and add themselves to a no-crawl list that Semalt maintains. I reached out to Semalt’s Alex Andrianov on twitter to ask if their crawlers were ignoring robots.txt. He confirmed that Semalt.com’s crawler doesn’t respect robots.txt and claimed that they were unable to have it do so.

Twitter exchange with Alex Andrianov of Semalt.com

How to stop semalt.com from visiting your site

As Alex suggests, you can submit your site to Semalt to ask for removal from their crawl at their site though there’s no way to tell if they’ll act on this request. As I’m inclined to distrust crawlers that don’t respect robot.txt I’ve opted to block their access to my site through .htaccess as outlined by logorrhoea.net.

Update 15/4/14: Semalt manager Alex Andrianov suggested that parts of this post may be factually incorrect as I failed to note I am not a Semalt customer. I would like to state that I am not a Semalt client but that I stand by the information listed here are true and welcome any factual corrections.

semalt2

Birmingham Open Code

From April the 8th there will be a weekly event for collaborative programming study sessions in Birmingham. We’ll be meeting in the Woodman Pub from 6 pm.

Birmingham Open Code is designed to provide a peer supported, mixed level learning environment. Programmers and aspiring programmers working in any language are welcome. The weekly schedule is designed to create a casual environment where learners can drop in for social learning as needed, without feeling the need to make every event. We’re looking to keep these study sessions as inclusive as possible. You’re welcome no matter your skill level, level of education, age, gender, race, sexual identity, or sexual orientation.

If you’re an established programmer bring your laptop and be ready to help out newbies while socializing with your peers. If you’ve never programmed before and want to start, bring some great questions to get you started in the right direction.

There are also a number of hands on workshops in a range of technologies and experience levels in the pipeline. These may be added as monthly events to supplement the Open Code study sessions. Currently workshops in introductory and advanced Python, technical writing and Ruby have been proposed. To lead your own workshop, get in touch at jessica(at)closetoclever.com.

The space is handicap accessible and close to both Birmingham’s Moor Street and New Street stations.

SEO Basics: Google Penalties

We’ve already looked at why links are important in SEO. Having relevant, high quality sites linking to your content can offer search engines a vote of confidence about your content. In an effort to keep webmasters from creating spammy or valueless links to artificiality inflate the value of their content, Google has released algorithms to detect unnatural linking. The Penguin updates have been designed to find unnatural linking patterns and to automatically adjust the rank of sites which have been found to violated their guidelines. The negative adjustments are called penalties.

Types of Google penalties

Google penalties can be assigned automatically through Google’s algorithms or can be assigned manually. With a manual penalty, you’ll be alerted to the penalty within Google’s Webmaster tools with details about the penalty and examples of the guideline violations found. Registering your site with Webmaster Tools is a great idea, regardless of your risk for penalties as it will offer you valuable information about your site.

Algorithmic penalties are assigned automatically and aren’t accompanied by a notice of penalty. The best way to determine if your site might have been hit by an algorithmic penalty is to look for dramatic changes in search rankings and traffic that happen independent to changes in the site.

How to avoid penalties

If you’re the only one working on your site, follow the Guidelines by Google in creating content and links to your site. If you’re working on your site alongside others, be aware of the work they’re doing and make sure everyone involved is aware of the guidelines and risks associated with not adhering to them. If you’re outsourcing your SEO to a third party, be aware of what work they’re doing on your behalf. Monitoring your new backlinks in Google Webmaster Tools or a third party service like Moz or Majestic SEO can help you track links being built for your site before they cause any problems. Majestic will let you check all the backlinks of your own site for free, making it easy to keep tabs on your risk levels at no extra cost.

Have questions about Google penalties? Ask in the comments below or find me on Twitter.

Learn to program through work

I’ve been lucky enough to leverage a non-technical role in a tech company to help further my study of programming. I wanted to offer actionable tips for how others working around tech can use the resources at their work to further their own studies.

Find an information rich environment

If you’re already working in an environment that uses the technologies you’re looking to learn, you’re going to be in a great place to access resources and support, even if your role doesn’t (yet) involve working with them. If you’re not currently in a workplace that interacts with the technologies you’re trying to learn, you may be better served by looking for support in community or educational groups, meet ups or looking for individuals working with the tech you’re interested in who might be able help. Or you might want to consider starting with the technologies available at work. Companies with training or internship programs may have more systems in place to offer on the job support for learning.

Look for existing programs and resources

Many larger companies may have programs in place to support professional development. These may include tuition assistance for courses, help paying for workshops or training or formal mentorship pairings. If your company is large enough to have a HR department, this should be your first stop. Be prepared to present a compelling business reason that they should offer you support.

Smaller companies may also have policies on supporting professional or personal development. Check your contract and handbook as well as asking around to see if you can find existing programs that fit your needs. If you can’t find anything currently in place and have a compelling business reason for support, ask a member of HR or management if they’re interested in talking about ways they could help.

Companies without formal policies to help support your learning may still have resources to help you as you learn to program. It’s not uncommon to find libraries of technical books available on development teams. Check to see if there is software, hardware or learning materials that you might be able to access.
technical books available at my office

Look for sources of informal support or mentorship

One of the best things to source through your workspace is the support and advice of people with professional experience in your subject. If you have colleagues who are willing to help out, start looking to them for help. These coworkers don’t have to be expereienced programmers to offer great support. Placement students, interns, new hires and anyone else who might also be try to working to better develop their own skills might be open to studying with you, as well as offering advice.

When you first start looking to coworkers for support, try to keep things casual. Asking for help with specific questions or concepts are easy ways for your coworkers to help out without the expectation of an ongoing commitment. Wait to ask for a formal mentor-mentee relationship till you’ve established a rapport with casual queries.

Have something to show your progress

Before you approach colleagues for support, make sure that you have something more than your aspirations in hand. Do the foundation work in learning to program on your own and have proof of your progress. People are going to be more interested in helping active, self sufficient learners who can demonstrate that they’re serious about learning. I recommend starting with an online tutorial system like Codeacademy or online courses. Getting started on your own also gives your coworkers more manageable questions to answer. Answering “Why would I need to use a static class in this code?” is a faster and easier task than answering “How do I program with JavaScript?”.

Have clear goals

Before you try to involve colleagues in your learning goals, make sure you have a clear understanding of your short term and long term goals. Are you studying in order to better function in your current role, or do you dream of moving into the dev team? What projects do you want to take on to meet this goal? What do you want to get done on those projects this week?

Ask good questions

Your colleagues may be happy to help, but make sure you’re only coming to them with the questions you need answered by a real, live person. Check Stack Overflow and Google any specific problems you encounter. To get information on bigger programming concepts, work with technical books and YouTube tutorials. If none of these can answer your question, they’ll still leave you better able to ask great, well informed questions when you do have to run something past your coworkers.

Demonstrate the value of your expanding skill set

If you’re asking for support from your workplace, make sure that you’re able to show that they’re getting something back from the process. Using your new skill set in the workplace will help give you valuable real world experience in your subject area. It’ll also show those supporting you that supporting your continued studies will pay off through your ability to contribute more to the team.

Start with small and specific tasks

If you’re going to be trying to expand the scope of your current role to let you work with tasks which reinforce your developing skill set, be sure to start with small, manageable tasks before moving on to larger projects to test your programming chops. Tasks like re-writing documentation, answering technical support requests and simple bug fixes are a great way to get started handling increasingly challenging technical tasks.

Respect IP and company policies

If your work is giving you the chance to learn on the job, make sure you’re not furthering your studies at their expense. Don’t create projects that compete with their products or use their resources. If you’re not sure if your project is going to fall foul of company policy, ask someone!

Don’t source all your learning through work

It’s great to have a workplace that you’re able to find supports you as you learn to program. Just be sure that you’re not relying on them for all of your support and resources. Constant programming questions can be distracting in the workplace. Programming is also wonderfully idiosyncratic, with teams tending to develop styles unique to the team. If you want to develop flexible programming skills (and make sure you’re not taking up too much company time) try to find contacts outside of the workplace to help out, as well.

If you have any questions on or suggestions for using your workplace to learn to program, ask in the comments below or find me on Twitter.

Strings in Java

We already looked at primitive data types in Java. These primitive types are great if you need to create variables that are whole numbers, numbers with a decimal point, true/false states or single characters. But you’ll often need more than a single character in your code. Here you would use a string, or a single object made of a character or series of characters. We’ll look at how to create strings in Java and how they differ from primitive data types.

Differences between strings and primitive types

When you set a variable to a primitive data type in Java, you’re telling your program to set aside enough memory for this variable. It’s like getting out a box that’s the right size for your variable. If you set a variable to an int, for example, you’re getting out a box just the right size and shape for all but the longest whole numbers. You can change the value of this variable later in your program, think of it as tipping your box over to empty it and refilling it with something else that fits.

Strings in Java are different, right from the start. You may have noticed that all of the primitive types are written in all lower case letters. String needs a capital S. You’ll aslo need to wrap your string in quotation marks. It’s also structurally different. While a primitive data type can be changed again and again by reassigning the value, a string can’t be changed once it it set. If we think of a primitive type variable as a box suitable for holding values of the right type, Strings are more like a signpost, pointing your program back to where you set the value of the string. A primitive type variable carries its value around and can easily be changed, the string just refers back to when it was first set, each time it is used in your code.
difference between primitive types and strings in Java
Strings are objects made up of one or more characters which can include letters, numbers, spaces or other unicode characters. Strings can be made up of the same values as primitive values, though they’ll behave differently. The integer 32 and the string “32″ look similar, but they’ll have very different uses. If you’ll need the number to change, through math or other calculations, you’ll want to use the integer. If you want to create an object that just points you back to the fixed value 32, you’ll want the string.

String example = "This is an example of a string."
String blanks = "    "
String birthDate = "01/11/1990"

Printing a string to the console

You can print a string to the console by putting your string inside the brackets of System.out.println(). If you’re using Eclipse, the shortcut for System.out.println() is typing sysout and then control + space bar. Below is code which creates a string called exampleString, sets the value of exampleString to “This is my string” and then prints it to the console.

public class ExampleClass {
	public static void main(String[] args) {
		String exampleString = "This is my string.";
			System.out.println(exampleString);
	}
}

Questions about strings in Java? Ask in the questions below or find me on Twitter.

What is a Java class?

If you’re going to start learning Java, you’ll need to get comfortable working with classes. I’ve seen classes described a number of ways, the best of which is from Oracle’s Java Tutorials “A class is the blueprint from which individual objects are created.”

So a Java class is a blueprint, or outline of an object or series of objects you intend on using in your Java project. Classes can include the name of the class, what attributes the class may include and what methods the class will use.

Let’s say we want to make an animal class. First we’ll create the class Animal.

 class Animal {}

Now we’ll add the attributes we’ll expect to see in the curly braces. In Java, you’ll have to specify the type of your variable is before setting its name. We’ll want our animals to have a set number of eyes, legs and a sound it makes. We’ve set the eyes and legs to int, or integers, because they’ll be whole numbers. Our sound is going to be a word or phrase we supply, so we’ll set it to a string.

 class Animal {
    int eyes;
    int legs;
    String sound;
}

We’ll also want to make our animals speak, using the sound attribute. So we’ll add the method speak to our Animal class. This method should print the animal’s sound to the console.

 class Animal {
    int eyes;
    int legs;
    String sound;
    void speak (){
        System.out.println(sound);
}
}

Now that we’ve built our Animal class, we need to construct specific animals using the class. We’ll do this by using the aptly named constructor. We’ll make one of these:

cute stuffed octopus

It’s an octopus, has 2 eyes, has 8 legs and we’ll say that it makes a bubble sound when it speaks.

	Animal fluffyOctopus = new Animal();
	fluffyOctopus.eyes = 2;
	fluffyOctopus.legs = 8;	
        fluffyOctopus.sound = "bubble";
}

We’ve got our finished octopus! But…it doesn’t do anything yet, which isn’t very satisfying. Let’s add the speak method in, to have the octopus print the sound it makes to the console.

	Animal fluffyOctopus = new Animal();
	fluffyOctopus.eyes = 2;
	fluffyOctopus.legs = 8;	
        fluffyOctopus.sound = "bubble";
        fluffyOctopus.speak()
}

Now we need to put these together in the right order, to be able to create our fluffy octopus and have it speak. All of the code we’ve written so far will need to be inside our public class. I’ve named mine App. Our Animal class is a static class, which we’ll learn more about later. Don’t forget that the class Animal needs to come before the fluffy octopus, as we need to have the Animal class defined before creating a new animal.

public class App {
	static class Animal {
		int eyes;
		int legs;
		String sound;

		void speak() {
			System.out.println(sound);
		}

		public static void main(String[] args) {

			Animal fluffyOctopus = new Animal();
			fluffyOctopus.eyes = 2;
			fluffyOctopus.legs = 8;
			fluffyOctopus.sound = "bubble";
			fluffyOctopus.speak();

		}
	}
}

If you’ve been following along to create this code in your IDE (or if you copy and paste it) you should have working code that produces the word “bubble” when run. Try changing the code to get different results. Can you have our fluffy octopus tell you how many legs it has? Or create your own animal, under the octopus.

Questions about classes in Java? Ask in the comments below or find me on Twitter.

SEO Basics: Bots and SEO

Before a search engine can list content in search results it first needs to find content and gather information about it. Search engines discover content and collect data through web crawlers, a specialized type of bot that crawls the web.

What is a bot?

A bot is a simple, automated piece of software with a specific purpose, usually used for repetitive tasks. Web crawlers are a type of bot designed to navigate through the web and collect information, following links as they travel. Search engines use the data collected by web crawlers to populate their indexes. When people talk about a page having been indexed by Google, they mean that the page has been crawled by Google’s web crawlers and that the data found was added to Google’s index.

Web crawlers are often described as digital spiders, made of code, that crawl along web content to gather data. Due to the size and ever expanding nature of the web, most web crawlers have a priority based crawl, visiting the most important or accessible content first.

What do bots crawl?

Web crawlers are tireless workers, but they’re not very smart. They can only crawl a limited range of content. The can easily crawl HTML code and text. They can crawl images as well, but will look to text associated with the image, such as alt text, to get an idea of what the image represents. They can’t crawl more complicated content types, like JavaScript code.

How to limit bots on your site

As a webmaster, you can decide if, where and at what speed you allow web crawlers to interact with your content through creating a file on your site called the robots.txt. Reputable and well behaved bots will check the robots.txt file of your page before crawling to be sure that they’re permitted to crawl your content. Use this file to limit bots from crawling any content you don’t want indexed.

Bots can also be limited or barred from crawling pages on the server level, by some types of security software or by your hosting company. If you find that bots you want visiting your site are being barred, check your robots.txt and talk to your hosting company to be sure that they’re not barred from crawling.

Bots and SEO

Because search engines source some of their data about the content and value your site through their crawl, it’s in your best interest as a webmaster to make your content accessible and appealing for their bots.

Have a clear set of keywords or goals in mind when creating new content and be sure that you’re presenting these keywords or concepts in ways that web crawlers can easily access. Using URLs and titles that clearly describe the content can be a great way of letting web crawlers know what kind of keywords or concepts they may want to associate with the content.

Luckily, most of the simple changes that make your content more bot-friendly also make your content more user-friendly! Writing meta descriptions for your content can help both bots and potential visitors understand what to expect from specific content. Adding high quality alt text to images both helps web crawlers understand the value of your images and makes your site more accessible for users with screen readers.

Have questions about bots and SEO? Ask in the comments below or find me on Twitter.

Primitive Types in Java

In Java, you’ll need to declare a variable type before setting the variable. Some of the simpliest and most common variable types are primitive data types. These include numbers of various types, single characters and true/false designations.

First we’ll take a look at different types of primitive types that can be assigned to whole numbers. The big difference between the primitives for whole numbers is the amount of memory that can be assigned to each. I find myself using int most often, but byte and short can be useful in arrays where memory might be an issue.

byte (8 bit): This can hold negative or postive whole numbers ranging from -128 to 127
short (16 bit): This can hold negative or positive whole numbers ranging from -32,768 to 32,767
int (32 bit): This can hold negative or positive whole numbers ranging from -2147483648 to 2147483647
long (64 bit): This can hold negative or positive whole numbers ranging from -9223372036854775808 to
9223372036854775807

There are also primitive types in Java that can be assigned to numbers with decimal places.

float (32 bit): This can hold single point precision floating point numbers and is set to 0.0f by default.
double (64 bit): This can hold a double point precision floating point number and is used for longer decimal point numbers

There are two primitive types that don’t involve numbers.

boolean: This can hold one true/false state
char: This can hold one 16 bit unicode character.

Have questions about primitive types in Java? Ask in the comments below or find me on Twitter.