Thursday, November 13, 2008

How Leaders Create and Use Networks

How Leaders Create and Use Networks (pdf)

> She resolved to make time for an extracurricular passion—the theater—in a way that would enhance her business development activities. Four times a year, her secretary booked a buffet dinner at a downtown hotel and reserved a block of theater tickets. Key clients were invited. Through these events, Linda not only developed her own business but also learned about her clients’ companies in a way that generated ideas for other parts of her firm, thus enabling her to engage with colleagues.

> They network in order to obtain information continually, not just at formal meetings.

> they wait until they need something badly. The best networkers do exactly the opposite: They take every opportunity to give to, and receive from, the network, whether they need help or not.

Thursday, August 21, 2008

Shall our research agenda revolve around Positive Psychology?

Positive psychology is a recent branch of psychology that studies why some people feel particularly happy. "Positive psychologists seek to find and nurture genius and talent," and "to make normal life more fullfilling, not to cure mental illness" (wikipedia). The father of positive psychology, Martin Seligman, gave a talk at TED.

In his research, he has found that happy people differ from miserable ones in one respect: they have pleasant, good, and miningful lives (find below a rigorous description of what he means by pleasant, good, and miningful). He councluded his talk (at minute 20:00) by proposing to further research on how technology may make our lives pleasant, good, and miningful. Shall we enrich our research agenda? :-) It would be a good start to read Learned Optimism or Stumbling on Happiness.

-------------------------------------------

. Pleasant Life or the "life of enjoyment" examines how people optimally experience, forecast, and savor the positive feelings and emotions that are part of normal and healthy living (e.g. relationships, hobbies, interests, entertainment, etc.).

. Good Life or the "life of engagement" investigates the beneficial affects of immersion, absorption, and flow that individuals feel when optimally engaged with their primary activities. These states are experienced when there is a positive match between a person's strength and the task they are doing, i.e. when they feel confident that they can accomplish the tasks they face.

. Meaningful Life or "life of affiliation" questions how individuals derive a positive sense of well-being, belonging, meaning, and purpose from being part of and contributing back to something larger and more permanent than themselves (e.g. nature, social groups, organizations, movements, traditions, belief systems).

Sunday, June 15, 2008

future: apple-controlled mobile browsing

Imagine that iPhone 2.0 attains the kind of market penetration currently enjoyed by the iPod music player. ... We will have sleep-walked into a different world. How come? Because whereas the personal computer is an open, user-configurable device, the iPhone is a decidedly closed one. Nothing runs on it other than software approved by Apple. This is not because the iPhone is incapable of running complex software: on the contrary, it is, in fact, a powerful Unix computer. But users who treat it as such - for example, by installing 'unofficial' software on it - run the risk of having their device 'bricked' [disabled] the next time they synchronise it with the iTunes software on their PCs.

Of course anyone can write programs for the iPhone with the aid of Apple's (free) Systems Developers Kit. But the only way they can get them installed will be via Apple's 'App Store'. And nothing will get into the store unless it's been approved by Apple.

If Apple's strategy succeeds, an increasing proportion of internet users will access through a gateway entirely controlled by a single company.

From guardian.

Friday, June 6, 2008

Challenges & Drivers for Mobile web 2.0 applications

This whitepaper categorizes mobile web 2.0 applications by three verbs - share, collaborate, exploit. For example, in location-based services, users share their location with others, collaborate with those narby and exploit local knowledge.

Challenges
* Telcos control the distribution channel and content on it.
-> It's impossible for small start-ups and developers to innovate
* Flat-rate data pricing undermines revenues
* Don't exist performance analytics for advertisers

Drivers
* Offer new ways to deliver ads
* Anywhere/anytime accessibility of mobile phones
* Convergence of mobile & web worlds
* Flat-rate pricing (cheaper and more transparent for customers)
* If operators' portals will be open --> each operator can search beyond its portal

Tuesday, June 3, 2008

Mobile Cultures

It is tempting to look at Japan (and Tokyo in particular) as a vision of the future for other countries or cities. ... However, there are more fundamental aspects of a society that cannot be disentangled from their attitudes and integration of technology.

Firstly, Tokyo is all about public transport. Trains run at minute intervals and are packed. Public transport is prime info-snacking space.

Secondly, population density in Tokyo results, obviously, in small accommodations. One consequence of this is that people, and young people in particular, spend a lot of time outside". More.

Branding Yourself With A Blog

" Certainly personal branding isn't a new concept, but the future of personal branding could be in at your fingertips with a blog. One of the first steps in creating a brand for yourself is to make your blog visible. Post meaningful entries, comment on your industry's top blogs, or simply gain a regular readership. Visibility creates opportunities" says Schawbel, a social media specialist at EMC Corporation. He believes that when you brand yourself, the competition becomes irrelevant. "The goal of personal branding is to be recruited based on your brand, not applying for jobs," Schawbel says. "

Friday, March 7, 2008

Seeing Our Signals: Combining location traces and web-based models for personal discovery

Here is the paper.

By combining web applications and mobile phones, they are developing algorithms for implementing a Personal Environmental Impact Report (PEIR). PEIR is a mechanism for longitudinal documentation of both impact—what an individual does to the environment—and what the environment does to the individual, or exposure.

Eigen-Trend: Trend Analysis in the Blogosphere based on Singular Value Decompositions

Here is the paper.

Situation: The blogosphere provides free large-scale information sources from which businesses can quickly learn trends (e.g., opinions and complaints from their customers). How? For example, by keeping track of how often relevant keywords are mentioned across blogs (summing up the occurrence of keywords).

Complication: This simple way of extracting trends has tree problems:
1) Different blogs contribute to the trend differently.
2) For the same keyword, different groups of blogs may have different interests.
3) Can we directly study and extract meaningful trends from such a dynamically changing blog graph structure? The blogosphere can be considered as a blog graph where the nodes are blogs and the links reflect endorsements and interactions among blogs. In addition, such a blog graph is changing with time as a result of the development of internal relationships (e.g., interactions among blogs) and external events (e.g., breaking news).

Solution: The authors propose eigen-trends, temporal indicators derived through singular value decomposition, that take differences among individual blogs in consideration. The key idea is to represent the observed data as a combination of information that captures temporal changes of the underlying data (i.e., eigen-trends) and information that captures the characteristics of individual bloggers (e.g., authority).

Boosting Topic-Based Publish-Subscribe Systems with Dynamic Clustering

Great paper.

Problem: The maintenance overhead of a topic-based pub-sub system becomes particularly dominating when the system supports a large number of topics with moderate event frequency (e.g., news syndication scene).

Fact: One can typically detect correlations between users subscriptions, which can be used to group topics and reduce the overall maintenance cost. For instance, users that are subscribed to updates for a given piece of software (say the free bitmap image editor GIMP [25]) are likely to be also subscribed to updates of other software pieces on which the given software depends (e.g. GTK+, libart and Pango [9]).

Proposal: A dynamic distributed clustering algorithm that

+ utilizes correlations between user subscriptions to dynamically group topics together, into virtual topics (called topic-clusters).

+ continuously adapts the topic-clusters and, resp., the user subscriptions, to the changing system state by employing local cluster updates. Each local update is performed only when it is estimated to be (globally) cost effective. Furthermore, to minimize the overhead involved in gain estimations, a probabilistic component is employed to guarantee that (with high probability) gain estimation are computed only for updates that are likely to be beneficial.

Wednesday, February 27, 2008

Pecha kucha in Tokyo



I went tonight, and it was really cool.

"We are all familiar with death by PowerPoint (one bullet at a time). Pecha kucha aims to cure this by limiting PowerPoint presentation to 20 slides, 20 seconds per slide. A total experience of just 6 minutes, 40 seconds. The idea is to build compelling presentations around powerful images and simple messages. "

Old article on Wired, and here are 3 presentations done worldwide:



Thursday, February 21, 2008

(try to) fly what you build!

Patrick (my ex-flat mate in Tokyo) & his colleagues have designed a flight simulator in which you build your own plane. Watch it or give it a go.

Wednesday, February 20, 2008

State vs. Religion

"Secular schools can never be tolerated because such schools have no religious instruction, and a general moral instruction without a religious foundation is built on air; consequently, all character training and religion must be derived from faith ...we need believing people."

This poorly-researched book makes points similar to the quote above, which is by ...


Sunday, February 3, 2008

web 2.0

Xavier Comtesse's matrix:
1- Knowledge
-- raw data
-- data with context
-- data with context over time that leads to
-- modelisation and then to
-- forecast

2- User involvement in the value chain:
-- receive info
-- pick and choose (self service)
-- customize (Do-it-yourself)
-- co-design
-- co-creation

Thursday, January 24, 2008

HOWTO: Avoid common problems when writing a conference paper

Tips by Peter Gutmann.

Every year I serve on the program committees of a number of computer conferences, which means that I get given a pile of papers to review for publication. Every year I see the same problems with submitted papers. And every year I decide I should write down some guidelines on things to look out for when writing a computer conference/journal paper. This year I've finally got around to it... The following is a list of the problems that most commonly crop up in submitted papers. Treat this as a checklist to apply to the paper you're planning to write, or have already written.

Existing Material -----------------
Providing a summary of existing work in the area covered by your paper is extremely important. This serves two purposes, firstly it lets the reader know that you're familiar with the field, and more importantly it lets you know whether what you're doing is useful and/or original or not. I don't know how many papers I've received that I've had to reject with a comment like "This is a poorly-done reinvention of a technique from the 1980s" (and no matter what topic your paper is on, at least one referee is going to tell you that it's a reinvention of something that Multics was doing 40 years ago). Many authors simply plough ahead with a topic without (apparently) performing any checking to see whether anyone has done this before. Others do a quick check with Google and leave it at that, however online archives typically only reach back a decade or so and are unlikely to tell you that you're reinvented something that first appeared four decades ago.

I can't over-emphasise how important this first step is - I've seen fifteen- to twenty-page papers killed in an instant because the authors never checked to see whether someone else had had this idea before them.

Approaching the Problem -----------------------
Experimental design can be a bit of an art form. As the saying goes, a problem well stated is a problem half solved. Although it's relatively rare to find the kind of biased metrics that are traditional in competitive benchmarking (when the goal is to prove that your system is better than the competition's), it's unfortunately too common to find problems like an inappropriate level of detail (either too much or too little), or poor or even no analysis.

The level-of-detail issue is an important one to get right. You want to present enough information that others can repeat the experiment, but not enough detail to overwhelm the reader. At the too-little-detail end of the scale, I've seen papers with results that would be impossible for anyone else to reproduce, either because the raw data isn't available (a relatively common problem that others have complained about), or because data acquisition details are insufficiently specified, or because the data-to-information transformation step(s) are insufficiently explained. The goal of a scientific process is repeatability and verifiability. This can be particularly important if the results are unusual or controversial. In a psychology paper published in the late 1990s with rather astonishing results the authors took the unusual step of having the work repeated by independent teams in the US and Europe, as well as making the raw data available for others to use (many did, and got the same results as the original authors). Without this level of detail, the very next research result could invalidate yours, and without repeatability no-one can demonstrate that you were correct.

At the other end of the scale are the papers that are decorated with various distractions like snippets of source code that caught the authors' attention and long-winded discussions of possibly interesting but not terribly relevant minutiae. Yes, it's interesting stuff, but it detracts greatly from the main flow of the work. If you're going to do this, put it in a separate section so that it doesn't distract from the main body of the work.

One other thing that you need to do is state your assumptions in advance. This is particularly important for papers in areas like networking and security, where the lack of a proper network performance model or threat model can make a paper almost meaningless ("assume a perfectly spherical elephant of negligible mass and volume"). In theory this could help kill a paper - I've seen security papers where the threat model is taken straight from la- la land - but in practice this doesn't seem to be much of a problem. Unrealistic models are a sure sign that your workflow was { design a cool algorithm, invent an artificial problem for it to solve }. A friend once told me about a paper he'd co-authored on a neat algorithm for which they had to invent an artificial problem for the algorithm to solve. The next year the same conference where they'd published it received three more papers on solutions to this non-problem (it probably has its own research field by now). So you can usually get these things published if you're careful about which conference you submit to. On the other hand if I'm refereeing something I'd prefer to see an honest assessment of the algorithm's value ("it's a cool algorithm and we wanted to share the design with others") rather than some totally bogus attempt at a justification for publishing.
While an artificial network or threat model isn't a killer, lack of any model at all is, because people are going to come up with situations that your design doesn't handle and tell you to resubmit once you've addressed them. Providing a model frames your solution and lets the reader know what it is that you're trying to achieve. It also means that reviewers will try and pick holes in your model rather than your design. (An unfortunate side- effect of this is that you can then design completely unrealistic models, see the previous paragraph).

Analysis of Results -------------------
One thing that can completely kill a paper is lack of analysis of the results. I've seen otherwise fine papers die in review because the "analysis" consists of the authors stating what's obvious from the graphs they present ("foo increases linearly") but never explain why foo increases linearly, or whether foo increasing linearly is a good or bad thing. More scary are papers where highly suspicious results (for example an O(n) and O(n^3) algorithm having identical runtimes) pass unnoticed. This is an indication that either the authors have been careless or that they don't understand the data that they're looking at, neither of which provide much confidence in a scientific paper.

When you present your results, you need to provide an analysis of why you obtained those results, not just a voiceover for the graphs and diagrams. If you have an O(n) algorithm but are seeing identical runtimes for n=100, n= 1000, and n=10000, you'd better have a good explanation for this. More generally, any significant anomaly in your results will need either further investigation or an explanation. One of the purposes of a paper is to demonstrate (and convey to readers) an understanding of what you're doing. Multiple unexplained anomalies can lead to the impression that this understanding is absent in the paper.

Even worse than an unexplained anomaly is a totally unnoticed anomaly. I once read a paper on crypto performance measurement that compared (among other things) the performance of DES with triple DES, where triple DES is DES iterated three times. The execution times for DES and triple DES were identical (!!), but the author's hadn't appeared to notice this.

A feature of many network protocols (and underneath the protocols, network hardware) is the extensive cacheing and buffering performed whenever possible to reduce latency and speed throughput. One conference received a paper whose results showed that the SSL/TLS handshake was about a thousand times faster than the IPsec IKE handshake. Now despite the best efforts of the IPsec design committee, the probably didn't manage to make IKE more than a few times slower than SSL/TLS. What had happened was that SSL has the capability to perform a re-handshake using cached data, and what the paper was actually measuring was the overhead of n IPsec handshakes vs. 1 SSL handshake and n-1 cached reconnects.

Another thing that's important when presenting results of measurements is to explain (and demonstrate understanding of) exactly what it is that you're measuring. The two protocols mentioned above, IPsec and SSL, both have numerous variations in their crypto mechanisms that can lead to vastly different performance results. For example the IKE handshake was originally designed by mathematical cryptographers who assumed that all crypto operations had an O( 1 ) cost. As a result, the original IKE public-key handshake design required no less than four (!!) very expensive private-key operations, compared to zero for the SSL client and one for an SSL server.

It's important in areas like this to both describe explicitly which types of operations your results are presenting and to either measure the standard, default operating mode or if you're using variant modes to make it clear to the reader that the results are atypical and not necessarily representative of normal performance.

(There's a lot more to say about this area such as techniques for analytical modeling, how to handle outliers, sensitivity analysis, and so on. This is covered in a number of standard texts, of which the most extensive is probably Raj Jain's "The Art of Computer Systems Performance Analysis". You also need to adjust for things like clock granularity, the overhead of clock sampling, influence from external factors like other network traffic or CPU load, and so on. For example relying on the operating system's ability to generate distinct timing data on a per-process level isn't going to help if another process is thrashing the CPU cache, which no amount of process- granularity timing is going to fix).

In terms of statistical analysis of results, papers tend to fall into two categories: Either very little, or way too much. Very little can work (it's an O(n) algorithm, end of story). The other end of the scale is rather more problematic. The result is a paper that's only comprehensible to another statistician, a variation of the "much data, little analysis" problem mentioned earlier. Applying Noodleheinz's strategy for quantum normalisation to yield a correlation factor of 0.167 with a variance of 7.31 to a pile of data may be a fine demonstration of your grasp of statistics (or perhaps a demonstration of your lack of grasp of statistics), but it may as well be written in Klingon for most of the audience (unless your audience consists of the attendees of a conference on network performance analysis). If you're going to apply statistical techniques, explain what you're doing and explain the significance of the results so the reader can understand it as well.

Diagrams and Tables -------------------
Some papers are severely let down by the quality of the graphs and tables that they contain. There have been plenty of books written on the graphical representation of data, so I'll just go through the main problem areas here. The three biggest offenders that I've seen are inappropriate use of graphs, inappropriate scales, bad labeling, and bad choice of units.

In a number of cases graphs are used where none are necessary. Graphing white noise is an extreme example of this. Graphs are intended to show a trend in data. If there is none, use a bar graph or a table, not a line graph. Similarly, if your graph has less than (say) five data points, use a table, or just put the data inline in the text.

When you're presenting information on a graph, don't try and overload it. In almost all cases a graph should present one set of y-variables (and by extension one set of x-variables). In some very special cases you can use two sets of y-variables, but you'd better have a very good reason for this. Any more than this and you're making your graph incomprehensible. In addition, don't try and cram more than five or six curves onto a graph, or you run into the same problem.

Use of appropriate scales results in data being crammed into one corner on one side of a graph, or all detail for 99% of the data being lost through an attempt to fit a single anomalous spike into the result. Look at your graph. If 90% of it is taken up by empty real estate, you need to consider an alternative way of representing it. Conversely, if 90% of it is taken up by ink, you've got a similar problem. A good rule of thumb is to try to maximise the information-to-ink ratio in your graph. There's something called the three-quarter-high rule which says that the highest point on the graph should be at least three quarters of the horizontal offset of the rightmost point, which is meant to avoid biased representations when using techniques like nonzero origins of the axes, but can lead to odd-looking graphs (I'll leave it up to you whether you want to use this or not).

Another thing to be aware of is that when plotting data resulting from random quantification of performance (so that if you were to repeat the experiment the results wouldn't be exactly identical), you should really show confidence intervals in your graph. People often don't do this and it doesn't seem to cause any problems, but you should at least mention the fact in the accompanying text if what you're measuring requires it. Obviously if it's some completely deterministic process there's no need to do this, and doing so will just clutter the graph or text.

Bad labeling is another problem, and can take many forms. I've seen text crammed in at odd angles, complex graphs with data represented in near- indistinguishable shades of grey, graphs where it wasn't obvious whether large values were better or worse, graphs where the axes were barely labeled (10 minutes? 10 K/s? 10 MB?), graphs where the units aren't obvious ("Time" and a scale of 1...10, which could be nanoseconds, minutes, days, or years), and everything in between. Remember, graphs are intended to visually represent trends in data. If this isn't immediately obvious from your graph, redesign it or replace it with a table.

Finally, something that besets both graphs and tables is the inappropriate choice of scaling units, so that everything ends up with strings of leading or trailing zeroes. Go with the most common scaling unit. Even if "seconds" is a nice basic time unit, if all of your graph labels or table entries are coming out as "0.00000x s" then it's time to consider using microseconds and not seconds.

Where's the Rest of it? -----------------------
A problem that crops up in some papers is that the authors construct an (often elaborate) demonstration and analysis of a problem without really providing a solution. If the problem is of the grand-challenge type (for example breaking AES encryption) then this is fine. Showing how to do this is enough to get you worldwide attention, and fixing it becomes someone else's problem. On the other hand if the problem is a TCP/IP congestion issue under certain network conditions then simply demonstrating that there's a problem isn't sufficient. Even providing a general solution ("congestion control is Good") isn't sufficient. If your paper has pointed out that there's a problem, you also need to provide the fix for the problem. The contribution in an AES- breaking paper is to demonstrate that AES is broken, but the contribution in a general paper is the solution to the problem, not the problem itself.

Referee Feedback ----------------
There's a reason why conferences and journals provide detailed feedback from referees rather just a credit-card style "accepted/declined". Someone once submitted a paper to a conference that duplicated (somewhat poorly) work that others had done a few years earlier (see "Existing Material" above). I sent it back with a reference to the earlier work and a longish explanation of how they needed to provide significant new material in order to make it suitable for publication. A few months later they submitted the same paper, unchanged, to another conference. Since they hadn't bothered to update the paper, I didn't bother to update my comments, and sent it back with exactly the same feedback. A few months later they tried again at a third conference, and the same thing happened. I guess eventually they found a conference obscure enough that there was no referee present who recognised that it was a poor duplication of existing work and got it published there.

The point here is that the referees aren't just providing comments because they want to exercise their typing skills. If you get a paper back with comments that there's a serious problem with it then there (usually) really is a serious problem with it that needs to be addressed. Even if it's quite obvious that the referees are idiots and don't understand your work, they are (hopefully) representative of your audience as a whole and if the referees don't understand it then it's likely that the audience won't either (although obviously there are exceptions - a paper on the wonders of X.509 PKI submitted to an OSS conference full of SSH and PGP users, or a paper on SSH's key- continuity key management submitted to an X.509 PKI conference, is going to be a tough sell no matter how well it's written).

In general though trying a different set of referees isn't going to change anything because they'll find the problem as well. Sure, you can eventually get around it by finding a venue so obscure that no-one will notice, but if you're doing it for the publication credits are you really going to get much value from being listed in the appendix to the apocrypha of the Proceedings of the Bratislava Philological Society?

How to sell sex


The Economist: Economists let some light in on the shady market for paid sex.

Steven Levitt (an economics professor at the University of Chicago and co-author of “Freakonomics”) presented preliminary findings from a study conducted with Sudhir Venkatesh, a sociologist at Columbia University. Their research on the economics of street prostitution combines official arrest records with data on 2,200 “tricks” (transactions), collected by Mr Venkatesh in co-operation with sex workers in three Chicago districts.

The results are fascinating. Almost half of the city's arrests for prostitution take place in just 0.3% of its street corners. The industry is concentrated in so few locations because prostitutes and their clients need to be able to find each other.

Prostitutes are more likely to have sex with a police officer than to be arrested by one.

Sex without a condom is the norm.

One controversial finding is that prostitutes do better with pimps—they work fewer hours and are less likely to be arrested by the police or preyed on by gang members.

In many respects, the paid-sex industry is much like any other business. Pricing strategies are familiar from other settings. Despite evidence of a myopic attitude towards risk, there have been plenty of recent examples of that in the finance industry too. Illegality and lack of regulation are likely to heighten public-health risks. The Ecuador study concluded that rigorous policing of street prostitution might limit the spread of STIs by directing sex workers into the safer environs of licensed brothels.

Wednesday, January 23, 2008

Top 10 Data Mining algorithms

The top 10 algorithms are: C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART.

Tuesday, January 8, 2008

Top Ten Best (and Worst) Communicators of 2007


Here.
To Mike Huckabee (first in the list): barack will win bro!

Bill rocks!

For the first time, Bill Gates is far cooler than Steve Jobs! Ironically, he is so during his last day at Microsoft: