Monday 5 December 2016

Search, and you shall find

I tend towards the view that Google sets out to be, and believes itself to be, on the whole a force for good. Sergey Brin's original motto for the company was 'don't be evil'; Google now says that its mission is "to organise the world’s information and make it universally accessible and useful"; that its core aim is "to make it as easy as possible for you to find the information that you need and get the things that you need to do done."

I'm going to take that at face value; in this essay I shall write as though I believe these claims to be true (and, in fact, that is true: on the whole I do).

So when Carole Cadwalladr, working from original research by Jonathan Albright, forensically demonstrates that Google is acting as a potent amplifier for neo-fascist propaganda, we need to ask what is happening.

There are, essentially, three possibilities:

  1. That Google is aware of what it is doing and has tuned its algorithms to promote neo-fascist views (for reasons given above I do not believe this is currently the case);
  2. That the neo-fascist right, by superior intellect and cunning, have been able to game the Google ranking algorithms (for reasons I hope to give below I don't wholly believe this);
  3. That a combination of naivety of Google's algorithms and the structure of far-right websites has accidentally created the current mess. This is what I (mostly) believe, and I shall explain why.

(Note that today Google tweaked the search suggestions system so that it no longer offers the 'Are Jews evil' suggestion that Cadwalladr highlighted, but this is a solution only to the problem of that specific query: it isn't a general solution)

How search works

People who are not computer scientists believe algorithms are inherently complex and magical. They're not. Most are very simple. Google's page-rank algorithm is now proprietary, and thus secret; it has presumably been tuned somewhat over the years since it was last open and public. But the core of it is extremely simple.

Previous search engines, like Alta Vista, had scored web pages based on the content of the page itself. Thus if a web page contained the word 'Jew' many times, these search engines would rank the page highly as a source of information about Jews (in this essay I am using 'Jew' and 'Jews' as an example of a word that has become gamed; this essay itself has nothing to do with Jews and says nothing either positive or negative about them). The more times the page repeated the word, the more highly it would be ranked for that word. This was based on the naive assumption that people writing web pages were honest, well intentioned, non-delusional and well informed. And as most people are honest, well intentioned, non-delusional, and don't write about subjects on which they're not well informed, for a while this algorithm worked well enough.

But it was extraordinarily easy to game. The search engine believed what a web page said about itself. The search engine reads the text of the page, not the formatted image (that's still true of Google today). So Alta Vista, although it would give higher weight to words that were in headings than to words that were in body text, gave the same weight to words which were, for example, in white on a white background (and which therefore a normally sighted human reader using a normal browser wouldn't see) as to words which were black on a white background. 'Invisible' text could be inserted into pages - including as headings - which search engines would see but readers would not. Very often this invisible text would be a single repeated word: 'Jew Jew Jew Jew', or variations 'Jew Jewish Jews Judaism'.

Google's insight was that what a page says about itself is not trustworthy; Google's insight was to treat what other, unrelated sites said about a page as more trustworthy. The Web is a collection of linked pages; rather than counting the words on a page, Google counted the words in links to the page. So if a page contained the word 'Jew' a hundred times, Google (unlike Alta Vista) would not be more likely treat that page as an authoritative source of information on Jewishness than if it did not contain the word 'Jew' at all. But if pages on a hundred other sites - that is, sites with a different domain name - all have links to the page, and all those links contain the word 'Jew', then Google would rank the page highly as a source of information on Jews. The greater the number of such links, the higher Google would rate it.

People, on the whole, are more likely to link to sites they agree with than to sites they disagree with. So for example, I create a lot of links to stuff by Andy Wightman, Lesley Riddoch, Cat Boyd, Vonny Moyes. Different communities of interest also use different vocabularies. So for example if you type 'land reform' into Google you'll get a set of results broadly favourable to land reform; if you type 'land grab' you'll get a set of results broadly unfavourable towards land reform. The reason is simple: those who oppose reform are much more likely to frame it as 'grabbing'.

So we have a situation in which a page which is linked to be a very large number of other pages with the word 'Jew' in the link text is rated highly as a source of information about Jews, and it happens that the majority of pages which use the word 'Jew' in link text use those links to point towards anti-semitic pages; and thus Google, using its very simple algorithm of counting the links which contain the word, treats those anti-semitic pages as authoritative about Jews. Google isn't being evil; it's simply being naive.

The question is why it happens that the majority of pages which use both the words 'Jew' and 'Evil' in links point to anti-semitic sites. Originally, I'm pretty sure it was happenstance. Thousands of rabid mouth-frothers created thousands of links on thousands of blogs, all using the word 'Jew'. Ordinary serious Jewish people, writing about Judaism, probably don't use the word 'Jew' very often, because in their discourse Jewishness is assumed; and in particular they're pretty unlikely to link it with the word 'evil', because people tend not to think of people within their own community as evil.

The Google game

But once this pattern emerges and is recognised, what happens? I can go out this morning and buy a hundred internet domains all with apparently unrelated names, all with a hundred apparently distinct registered owners. I can point those domains at servers I can hire cheaply in the cloud, and I can host a hundred different websites. On each of those websites I can host a page with a link with the text 'Jew', which points to a single, common page saying something negative about Jewishness. If I choose a page which is already fairly highly ranked on the word 'Jew', I can push it even further up the rankings.

This is a scheme which has already been used for years by spammers and scammers; it would be a miracle if conspiracists had not noticed and begun to exploit it. So, as I wrote above, I believe that the current situation where innocent searches can lead to extreme or malicious material has arisen by accident as a result of naivety on the part of an essentially-reasonably-honest Google; but I also believe that it has now begun to be gamed.

But beyond that, the search suggestion system can be gamed. The search suggestion system is just an ordered list of the most common search queries. It has some slight tweeks, but that's essentially it. So if a million monkeys sit at a million keyboards and type 'are Jews evil' into a Google search all day, then 'are Jews evil' quickly rises up the suggestion list and starts to be the first thing offered by the suggestion system when someone innocently types 'are Jews'. Of course, those monkeys don't need to be real monkeys - a bot-net of hacked computers could easily be programmed to repeatedly ask Google particular questions, forcing other phrases up the suggestion list.

Search in a capitalist society

I sat down yesterday evening to think, OK, how does international civic society work with Google to limit this problem, to algorithmically build a better notion of trustworthiness into the evaluation of links, when I stumbled on an - obvious, when you have thought of it, but very disturbing when you first stumble upon it - even more potent problem.

We live in a capitalist society. Capitalism is disentropic of wealth; people who have wealth have opportunities to accumulate more wealth which are not available to people who don't have wealth. This is true at all scales; a home owner has more economic opportunities than a tenant, a millionaire than a home owner, a billionaire than a millionaire. In normal functioning, in a capitalist society, wealth is concentrated more and more into fewer and fewer hands, and the rate at which this concentration happens accelerates over time. There is a stark tension between this fact and the idea of fairness which appears to be innate in human beings, which even very small children can clearly articulate. Historically, there have been events when capitalism has reached crisis, when wealth has been radically redistributed from the very rich to rest of society; the most recent of these was during and immediately following the Second World War.

But since then, the ratchet has been working quietly away again, as simple mechanisms will.

One of the things which happens when capitalism reaches crisis is the rise of the right. This isn't in the least bit accidental. People who are very wealthy wish by definition to remain very wealthy, since giving away wealth is easy. People with wealth can fund political campaigns, and political persuasion. It's no accident that, throughout the Western world, the bulk of mass media is owned, not by readers or workers co-operatives nor by civil society, but by individual plutocrats. It's no accident that very wealthy people stand for high office - and win.

'The immigrants are taking our jobs' is an explanation for the reason that employment is getting harder to find. But in an age of globalisation and automation, it's hardly a very persuasive one. There are alternative, more persuasive, explanations: the investor class has offshored our jobs; the technologists have automated them out of existence. Yet in the narrative surrounding both the Brexit vote in the United Kingdom and the Trump victory in the United States, it it accepted that a significant proportion of the vote was driven by xenophobia against immigrants.


Well, certainly one explanation is that the right amplified that message at the expense of the alternatives. And the reason the right should choose to do that is because the right represents the interests of capitalism's winners - those who have, by luck, chance, dishonesty, inheritance, or by any other means accumulated more than their equal share of the world's wealth, and who want to hold onto it. The 'offshore' and 'automation' narratives both place responsibility for the loss of jobs in western economies on the heads of the investor class which chooses where to place investments, and chooses which technology. The right seeks to shift responsibility for loss of jobs from the few powerful plutocrats to the many powerless migrants.

And the evidence is that they're succeeding, which is, tangentially, where we came in.

But the fact that the right is succeeding is not the horrible thought. On a level playing field we could counter the right's success in exploiting Google, either (which I would prefer) by working with Google to develop algorithms and architectures which would make it easier to assign a trustworthiness score to a link, or by creating a new left-oriented search engine, or by 'reverse gaming' the page rank algorithm, architecting a 'web of the left' to balance the existing web (whether accidental or designed) of the right.

But this isn't a level playing field.

There ain't no such thing as a free search

We don't pay for Internet search. We accept that search is, like so much else on the Internet, free to use. Of course, it isn't free to provide. To handle the billions of search requests Google receives each day, to run the spiders which continually explore the Web to keep search results up to date, to run the indexers which convert the pages collected by the spiders into ranked data that search responses can be collated from, takes a mass of hardware and an enormous quantity of bandwidth. But Google doesn't provide us with this free, rich search experience out of charity. It doesn't even provide it as a loss leader. On the contrary, it is the enormous profitability of search which cross subsidises Google's many more experimental ventures.

So how does free search convert into enormous profits? By building up a detailed picture of your interests in order to sell highly targeted advertising. To see what a search engine looks like without that revenue, look at Duck Duck Go. Duck Duck Go doesn't identify you, doesn't collect information on you, and doesn't sell the information that it doesn't collect to advertisers. It is also a commercial company, seeking to make profit from search. Instead of collecting data about you to sell on, it sometimes (but not often) shows adverts at the top of the search results.

Duck Duck Go is there, it works, it's relatively unintrusive. You could use it, but you don't. You don't use it partly because you know Google will find you what you want; you don't use it because you intuit (and, it happens, correctly) that the results will not be so good.

What you don't see is how up to date the results are. In a typical week, Googlebot - Google's spider - reads more than 500 pages from my personal website. In the same period, DuckDuckBot reads one. And that differential represents the difference in resources the two companies have. Google crawls websites based on their own metric of how often a site changes, but nevertheless they check most pages on my site most days; my site changes rarely. Sites which change more frequently are crawled more intensively. Google clearly has the resource to scan the whole web very frequently: search results from Google will always be very up to date. DuckDuckGo don't say how their spider searches, but nevertheless it's clearly much less often.

But there's more that DuckDuckGo can't do that we've come to expect Google to do for us. Because Google collects and stores a lot of information about us, it can tailor it's search results to be relevant to us. It knows what I've searched for in the past, where I live, what car I drive, which websites I visit, what items I've shopped for but not (yet) bought. It can show me things it thinks will interest me, and a lot of the time it's right. DuckDuckGo cannot do this, because of a choice - arguably an ethical choice - its designers have chosen to make: they've chosen not to collect the data which would make those personalisations possible.

Who owns our searchers?

Google is a commercial company which makes enormous profits by collecting a great deal of information about its users so it can target advertising at them. I continue to believe that Google is on the whole a relatively ethical company. At least one of its founders thinks seriously about the ethics of what Google does, and while his ethical judgements are not always the same as mine (and, it seems to me, do not always win out, these days), I don't see the company as ethically vacuous in the way many are these days, still less actually evil. I believe that if we could show Google how to develop referrer quality metrics and integrate them into search, they would do this. I believe that we could work with Google to make it harder for political interests (including ourselves) to manipulate search results.

As long as their mission is "to organise the world’s information and make it universally accessible and useful", as long as that is a sincere statement, we can work with Google, because improving the perceived political neutrality of their search (to the extent that there is such a thing as political neutrality) improves the quality of the product.

But Google is a publicly listed company. It can be bought. And Google is not necessarily the world's most popular librarian for ever; Facebook is coming up fast behind, and there's no pretence that Facebook is an ethical company. We cannot trust the places people go to find information on the Web will be benevolent. On the contrary, like big media, they are likely to become targets for people - very wealthy people - who wish to influence public opinion, just as the major newspapers and television channels have been.

Google has restructured itself to be part of a new group, called Alphabet (although Google is by far the largest and most profitable company in that group). Alphabet's market valuation is something more than half a trillion US dollars. That's about equal to one third of the combined total wealth of the poorer half of the world's population. The poor cannot buy Google, or anything like Google. The left cannot buy Google. But as few as ten of the world's richest people could club together and buy Alphabet. It would be a good investment. It's still very profitable.

And, of course, many of the world's richest people are (very) right wing.

The library of lies

Control information - control the information it is possible to search for, possible to discover - and you control thought. Heterodox ideas - heresies can be made unfindable. Books need not even be burned; they can simply be hidden, bowdlerised, altered; false, perverted copies can be produced as the real thing. False 'news' can be mixed with true until the two become indistinguishable, as has already begun to happen to readers of some newspapers and viewers of some television channels.

People discover the Web largely through search. It does not matter how much true information, how many clear and logical expositions of interesting heterodox opinion there are out there on the Web, if search - the search we choose to use - does not find it for us. Network effects mean that at any one time one search engine will dominate search - the biggest search engine has the most resources, so can be most up to date and responsive, so everyone uses it - why would you use anything else? Thus Alta Vista supplanted Lycos and Google supplanted Alta Vista. Possibly someone will come up with an algorithm so much better than Google's that they will sweep Google from the Web; more likely, companies like Facebook and Apple will fragment the Web into separate walled gardens in which they control search, and into which they don't allow third party spiders.

But whether Google remains king of the hill, or whether it is supplanted, the politically ambitious rich must now be eyeing search in the same way that fifty years ago they viewed broadcasting and a hundred years ago they viewed newspapers. Control information, and you control thought. And the means by which people access information, in a capitalist economy, can be bought.

Yes, I believe that the left - and civil society generally - could work with Google to create 'politically neutral' search, for some value of politically neutral. We could because, I believe, Google is at its core still a well-intentioned company. But in a future - a future I think under capitalism more or less inevitable - in which search is owned by people like the owners of the Daily Mail, the owners of Fox News, could we then work towards 'politically neutral' search?

Well, only to the extent that Fox News is now politically neutral television.

Look to windward.

No comments:

Creative Commons Licence
The fool on the hill by Simon Brooke is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License