Time for More Changes

This isn’t the first time I’ve changed blogging platforms, and it probably won’t be the last. I got tired of having to do maintenance on a blogging platform, so I decided to look for something lightweight. Enter Jekyll.

Jekyll is basically a static website compiler – it takes templates and content and produces static HTML output. No databases, no runtimes, no attack surface (beyond a static webserver). Given that I don’t mind writing in Markdown (in fact, I was using a Markdown plugin for Mezzanine), it seemed like a perfect fit. I wrote a quick script to get content out of Mezzanine/Django and export as HTML/Markdown, then spent some time tweaking the settings and theme (based on Hyde).

I’m going to be setting a goal to blog at least once a week, so watch this space for updates. And if you notice something odd going on (I know there are some issues with old posts and code blocks) please email me or ping me on Twitter.


Offensive Security Certified Professional

It’s been a little bit since I last updated, and it’s been a busy time. I did want to take a quick moment to update and note that I accomplished something I’m pretty proud of. As of Christmas Eve, I’m now an Offensive Security Certified Professional.

OSCP Logo

Even though I’ve been working in security for more than two years, the lab and exam were still a challenge. Given that I mostly deal with web security at work, it was a great change to have a lab environment of more than 50 machines to attack. Perhaps most significantly, it gave me an opportunity to fight back a little bit of the impostor syndrome I’m perpetually afflicted with.

Up next: Offensive Security Certified Expert and Cracking the Perimeter.


CSAW Quals 2015: Sharpturn (aka Forensics 400)

The text was just:

I think my SATA controller is dying.

HINT: git fsck -v

And included a tarball containing a git repository. If you ran the suggested git fsck -v, you’d discover that 3 commits were corrupt:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
:::text
Checking HEAD link
Checking object directory
Checking directory ./objects/2b
Checking directory ./objects/2e
Checking directory ./objects/35
Checking directory ./objects/4a
Checking directory ./objects/4c
Checking directory ./objects/7c
Checking directory ./objects/a1
Checking directory ./objects/cb
Checking directory ./objects/d5
Checking directory ./objects/d9
Checking directory ./objects/e5
Checking directory ./objects/ef
Checking directory ./objects/f8
Checking tree 2bd4c81f7261a60ecded9bae3027a46b9746fa4f
Checking commit 2e5d553f41522fc9036bacce1398c87c2483c2d5
error: sha1 mismatch 354ebf392533dce06174f9c8c093036c138935f3
error: 354ebf392533dce06174f9c8c093036c138935f3: object corrupt or missing
Checking commit 4a2f335e042db12cc32a684827c5c8f7c97fe60b
Checking tree 4c0555b27c05dbdf044598a0601e5c8e28319f67
Checking commit 7c9ba8a38ffe5ce6912c69e7171befc64da12d4c
Checking tree a1607d81984206648265fbd23a4af5e13b289f83
Checking tree cb6c9498d7f33305f32522f862bce592ca4becd5
Checking commit d57aaf773b1a8c8e79b6e515d3f92fc5cb332860
error: sha1 mismatch d961f81a588fcfd5e57bbea7e17ddae8a5e61333
error: d961f81a588fcfd5e57bbea7e17ddae8a5e61333: object corrupt or missing
Checking blob e5e5f63b462ec6012bc69dfa076fa7d92510f22f
Checking blob efda2f556de36b9e9e1d62417c5f282d8961e2f8
error: sha1 mismatch f8d0839dd728cb9a723e32058dcc386070d5e3b5
error: f8d0839dd728cb9a723e32058dcc386070d5e3b5: object corrupt or missing
Checking connectivity (32 objects)
Checking a1607d81984206648265fbd23a4af5e13b289f83
Checking e5e5f63b462ec6012bc69dfa076fa7d92510f22f
Checking 4a2f335e042db12cc32a684827c5c8f7c97fe60b
Checking cb6c9498d7f33305f32522f862bce592ca4becd5
Checking 4c0555b27c05dbdf044598a0601e5c8e28319f67
Checking 2bd4c81f7261a60ecded9bae3027a46b9746fa4f
Checking 2e5d553f41522fc9036bacce1398c87c2483c2d5
Checking efda2f556de36b9e9e1d62417c5f282d8961e2f8
Checking 354ebf392533dce06174f9c8c093036c138935f3
Checking d57aaf773b1a8c8e79b6e515d3f92fc5cb332860
Checking f8d0839dd728cb9a723e32058dcc386070d5e3b5
Checking d961f81a588fcfd5e57bbea7e17ddae8a5e61333
Checking 7c9ba8a38ffe5ce6912c69e7171befc64da12d4c
missing blob 354ebf392533dce06174f9c8c093036c138935f3
missing blob f8d0839dd728cb9a723e32058dcc386070d5e3b5
missing blob d961f81a588fcfd5e57bbea7e17ddae8a5e61333

Well, crap. How do we fix these? Well, I guess the good news is that the git blob format is fairly well documented. The SHA-1 of a blob is computed by taking the string blob , appending the length of the blob as an ASCII-encoded decimal value, a null character, and then the blob contents itself: blob <blob_length>\0<blob_data>. The final blob value as written in the objects directory of the git repository is the zlib-compressed version of this string. This leads us to these useful functions for reading, writing, and hashing git blobs in python:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#!python
import hashlib
import zlib

def git_sha1(blobdata):
    return hashlib.sha1(("blob %d" % len(blobdata)) + "\0" +
            blobdata).hexdigest()


def read_blob(filename):
    raw = open(filename).read()
    raw = zlib.decompress(raw)
    metadata, data = raw.split('\0', 1)
    _, size = metadata.split(' ')
    size = int(size)
    if len(data) != size:
        sys.stderr.write('Metadata shows %d bytes, data is %d.\n' % size,
                len(data))
        sys.stderr.flush()
    return data


def write_blob(filename, blob):
    with open(filename, 'w') as fp:
        fp.write(zlib.compress(('blob %d\0' % len(blob)) + blob))
        fp.flush()

We’ll use these to fix each of the commits in turn, but to do that, we need to figure out the busted commits and how to fix them. Using the combination of git log and git ls-tree, we can figure out the blobs for each commit and find that the order of the blobs is:

1
2
3
4
5
git log --oneline | tac | awk '{print $1}' | while read commit ; do git ls-tree $commit ; done | grep sharp.cpp
100644 blob efda2f556de36b9e9e1d62417c5f282d8961e2f8	sharp.cpp
100644 blob 354ebf392533dce06174f9c8c093036c138935f3	sharp.cpp
100644 blob d961f81a588fcfd5e57bbea7e17ddae8a5e61333	sharp.cpp
100644 blob f8d0839dd728cb9a723e32058dcc386070d5e3b5	sharp.cpp

So, the 3 broken blobs are, in order: 354ebf3, d961f81, and f8d0839. We can use git cat-file blob <id> to see the contents of each and look for obvious corruption. Doing this to the first file, we see a valid C++ file, with no syntactic corruption.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#!c++
#include <iostream>
#include <string>
#include <algorithm>

using namespace std;

int main(int argc, char **argv)
{
	(void)argc; (void)argv; //unused

	std::string part1;
	cout << "Part1: Enter flag:" << endl;
	cin >> part1;

	int64_t part2;
	cout << "Part2: Input 51337:" << endl;
	cin >> part2;

	std::string part3;
	cout << "Part3: Watch this: https://www.youtube.com/watch?v=PBwAxmrE194" << endl;
	cin >> part3;

	std::string part4;
	cout << "Part4: C.R.E.A.M. Get da _____: " << endl;
	cin >> part4;

	return 0;
}

Looking at line 16, we see the number 51337. Now, maybe I read too much into it, but it looks like 31337, which we all know is a slightly common number in CTFs. With no better reason, I decide to try replacing 51337 with 31337 and checking the blob hash. Works! Even though it’s overkill, I wrote a little script to do the fix:

1
2
3
4
5
6
#!python
def fix(filename):
    data = read_blob(filename)
    fixed = data.replace('51337', '31337')
    write_blob('blob.fixed', fixed)
    print git_sha1(fixed)

Running it, we get the hash 354ebf392533dce06174f9c8c093036c138935f3, and the file blob.fixed contains a new git blob, which we can place in the repository at .git/objects/35/4ebf392533dce06174f9c8c093036c138935f3. (At this point, I used git fsck -v to verify that we’re down to two corrupt blobs. Output omitted for brevity.)

Time to fix the next commit: d961f81. This includes the same 51337 -> 31337 fix, but there’s more corruption this time. This isn’t so trivial, but we get a clue from the commit for this blob:

There’s only two factors. Don’t let your calculator lie.

Looking at the blob, we see this section:

1
cout << "Part5: Input the two prime factors of the number 270031727027." << endl;

Turns out that 270031727027 has 4 factors, so I suspect this number has gone wrong. Let’s try mutating the bytes there to find one that corrects it. (I could try checking for two factors, but this is fast enough to not worry about it.)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
#!python
def permute_number(n):
    for p in range(len(n)):
        for i in string.digits:
            val = n[:p] + i + n[p+1:]
            yield val


def fix(filename):
    data = read_blob(filename)
    rawlen = len(data)
    before, after = data.replace('51337','31337').split('270031727027')
    for permute in permute_number('270031727027'):
        data = before + permute + after
        assert rawlen == len(data)
        if git_sha1(data) == target:
            print 'Found number: %s' % permute
            write_blob('blob.fixed', data)            
            break

This only takes a second to tell us that the number should be 272031727027 instead. The SHA1 matches, and we copy it to .git/objects/d9/61f81a588fcfd5e57bbea7e17ddae8a5e61333. git fsck -v again to check that git sees it correctly, and we’re off to the final blob. It turns out this one is very easy to fix by inspection, once we see this segment:

1
2
3
4
5
#!c++
std::string flag = calculate_flag(part1, part2, part4, factor1, factor2);
cout << "flag{";
cout << &lag;
cout << "}" << endl;

There is no variable named lag to take a reference of. Since the length is right, maybe we should try just changing that to flag. Incorporating our previous fixes, we get this fix script:

1
2
3
4
5
6
7
8
#!python
def fix(filename):
    data = read_blob(filename)
    fixed = data.replace('51337', '31337')
    fixed = fixed.replace('270031727027', '272031727027')
    fixed = fixed.replace('&lag', 'flag')
    print git_sha1(fixed)
    write_blob('blob.fixed', fixed)

Again, we copy to the objects directory. Now that git fsck -v reports a good repository, we reset to HEAD to get the right version of the C++ source (though we could have just taken the fixed variable from our script above, to be honest) and build it, then run it:

1
2
3
4
5
6
7
8
9
10
11
12
13
:::text
Part1: Enter flag:
flag
Part2: Input 31337:
31337
Part3: Watch this: https://www.youtube.com/watch?v=PBwAxmrE194
foo
Part4: C.R.E.A.M. Get da _____: 
money
Part5: Input the two prime factors of the number 272031727027.
31357
8675311
flag{3b532e0a187006879d262141e16fa5f05f2e6752}

Bam! 400 points in the bank.


What the LastPass CLI tells us about LastPass Design

LastPass is a password manager that claims not to be able to access your data.

All sensitive data is encrypted and decrypted locally before syncing with LastPass. Your key never leaves your device, and is never shared with LastPass. Your data stays accessible only to you.

While it would be pretty hard to prove that claim, it is interesting to take a look at how they implement their zero-knowledge encryption. The LastPass browser extensions are a mess of minified JavaScript, but they’ve been kind enough to publish an open-source command line client, that’s quite readable C code. I was interested to see what we could learn from the CLI, and while it won’t prove that they can’t read your passwords, it will help to understand their design.

All of my observations are from their git repo as of commit d96053af621f5e4b784aab3194530216b8d2ef9d. I’ll try to include code snippets as well to provide context in addition to line number references.

Deriving Your Encryption Key

Let’s start by looking at how your encryption key is determined. Looking at kdf.c, we see the following function:

1
2
3
4
5
6
7
8
9
10
11
12
13
void kdf_decryption_key(const char *username, const char *password, int iterations, unsigned char hash[KDF_HASH_LEN])
{
  _cleanup_free_ char *user_lower = xstrlower(username);

  if (iterations < 1)
    iterations = 1;

  if (iterations == 1)
    sha256_hash(user_lower, strlen(user_lower), password, strlen(password), hash);
  else
    pdkdf2_hash(user_lower, strlen(user_lower), password, strlen(password), iterations, hash);
  mlock(hash, KDF_HASH_LEN);
}

A couple of things worth noting: pdkdf2_hash is a function that uses different underlying functions on different platforms (OS X vs Linux), but just performs a basic PBKDF2 operation. It takes, in this order: salt, salt length, password, password length, number of iterations, and output buffer. It uses HMAC-SHA256 as the underlying crypto primitive. (And the misspelling of pbkdf2 as pdkdf2 is theirs, not mine.)

Also worth noting is the special case when iterations equals 1. Entirely as speculation on my part, but I suspect this indicates that they formerly did a plain SHA-256 (well, SHA-256 of the username and password concatenated) for the encryption key. This is genuinely speculative, but why else special case 1 iteration? 1 iteration of PBKDF2 is valid, though incredibly weak, so there would be no need for the 1 round case.

Other than the special case, this looks to me like a perfectly normal PBKDF2 implementation to get a strong encryption key from the password.

Deriving Your Login Hash

So, if the encryption key is generated that way, how do they authenticate users? Obviously, using the same hash would be problematic, as LastPass will then get the encryption key. Obviously, passing anything with fewer rounds would just allow someone to apply the extra rounds and derive the encryption key, so we need something else. Let’s take a look (conveniently also in kdf.c):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void kdf_login_key(const char *username, const char *password, int iterations, char hex[KDF_HEX_LEN])
{
  unsigned char hash[KDF_HASH_LEN];
  size_t password_len;
  _cleanup_free_ char *user_lower = xstrlower(username);

  password_len = strlen(password);

  if (iterations < 1)
    iterations = 1;

  if (iterations == 1) {
    sha256_hash(user_lower, strlen(user_lower), password, password_len, hash);
    bytes_to_hex(hash, &hex, KDF_HASH_LEN);
    sha256_hash(hex, KDF_HEX_LEN - 1, password, password_len, hash);
  } else {
    pdkdf2_hash(user_lower, strlen(user_lower), password, password_len, iterations, hash);
    pdkdf2_hash(password, password_len, (char *)hash, KDF_HASH_LEN, 1, hash);
  }

  bytes_to_hex(hash, &hex, KDF_HASH_LEN);
  mlock(hex, KDF_HEX_LEN);
}

A little bit longer than the encryption key, but pretty straightforward nonetheless. Assuming you have more than one iteration (as any new user would), you get the same hash as generated for the encryption key, and then use the password as a salt and do 1 PBKDF2 round on the encryption key result. This is essentially equivalent to an HMAC-SHA256 of the encryption key with the password as the HMAC key, which means converting the login hash to the encryption key is as difficult as finding a 1st preimage on SHA256. Seems unlikely.

It’s obvious to see that there’s still special-casing for one iteration. In that case, you get (essentially) sha256(sha256(username + password) + password). It’s still computationally infeasible to invert, but an attacker with the hash & associated username can trivially apply a dictionary attack to discover the original password (and hence, the encryption key). It’s a good thing they’ve moved on to PBKDF2. :)

How do they encrypt?

So, how do they handle encryption and decryption? Well, it turns out that’s interesting too. Looking at ciper.c, there’s a lot of code for RSA crypto, but that’s only used if you’re sharing passwords with another user. What does get interesting is when you look at their decryption method:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
char *cipher_aes_decrypt(const unsigned char *ciphertext, size_t len, const unsigned char key[KDF_HASH_LEN])
{
  EVP_CIPHER_CTX ctx;
  char *plaintext;
  int out_len;

  if (!len)
    return NULL;

  EVP_CIPHER_CTX_init(&ctx);
  plaintext = xcalloc(len + AES_BLOCK_SIZE + 1, 1);
  if (len >= 33 && len % 16 == 1 && ciphertext[0] == '!') {
    if (!EVP_DecryptInit_ex(&ctx, EVP_aes_256_cbc(), NULL, key, (unsigned char *)(ciphertext + 1)))
      goto error;
    ciphertext += 17;
    len -= 17;
  } else {
    if (!EVP_DecryptInit_ex(&ctx, EVP_aes_256_ecb(), NULL, key, NULL))
      goto error;
  }
  if (!EVP_DecryptUpdate(&ctx, (unsigned char *)plaintext, &out_len, (unsigned char *)ciphertext, len))
    goto error;
  len = out_len;
  if (!EVP_DecryptFinal_ex(&ctx, (unsigned char *)(plaintext + out_len), &out_len))
    goto error;
  len += out_len;
  plaintext[len] = '\0';
  EVP_CIPHER_CTX_cleanup(&ctx);
  return plaintext;

error:
  EVP_CIPHER_CTX_cleanup(&ctx);
  secure_clear(plaintext, len + AES_BLOCK_SIZE + 1);
  free(plaintext);
  return NULL;
}

What’s the significant part here? If your eyes jump to the strange conditional, you’ve found the same thing I did. What’s the difference in the resulting OpenSSL calls? It’s subtle, but it’s EVP_aes_256_cbc() versus EVP_aes_256_ecb(). If the ciphertext begins with the letter !, the next 16 bytes are used as an IV, and the mode is set to CBC. If it doesn’t begin with that, then ECB mode is used. This is interesting because this suggests that LastPass formerly used ECB mode for their encryption. If you don’t know why this is bad, I strongly suggest the Wikipedia article on block cipher modes of encryption. Hopefully this has long been addressed and the code only remains to handle a few edge cases for people who haven’t logged in to their account in a very long time. (Again, this is all speculation.)

For what it’s worth, just a few lines further down, you’ll find the function cipher_aes_encrypt that shows all the encryption operations, at least from this client, are done in CBC mode with a random IV.

If you’re wondering why the comparison looks so strange, consider this: if they just checked the first character of the ciphertext, then 1/256 ECB-mode encrypted ciphertexts would match that. Since ECB mode ciphertexts are multiples of the block length (as are CBC ciphertexts), checking for the length to have one extra character (len % 16 == 1) rules out these extra cases.

Transport Security

This section, in particular, is only relevant to this command line client, as the browser extensions all use the browser’s built-in communications mechanisms. http.c shows us how the LastPass client communicates with their servers. It really attempts to emulate a fairly standard client as much as possible – sending the PHPSESSID as a cookie, using HTTP POST for everything. One very interesting note is this line:

1
curl_easy_setopt(curl, CURLOPT_SSL_CTX_FUNCTION, pin_certificate);

They pin the Thawte CA certificate for their communication to help reduce the risk of a man-in-the-middle attack.

Blobs, Chunks, and Fields

I’ve only had a quick look at blob.c, which contains their file format parsing code, but I think I have a rough idea of how it goes. Your entire LastPass database is a blob, which consists of chunks. chunks can be of many types, one of which is an account chunk, which contains many fields.

Interestingly, if you look at read_crypt_string, it makes it obvious that, rather than encrypting your entire LP database or encrypting each account entry, fields are individually encrypted. Looking at account_parse, you can see that a lot of fields seem to be unused by the CLI client, but it’s interesting to see all the fields supported by LastPass. One of the most interesting findings is, in fact, right here:

1
entry_hex(url);

It can be confirmed by using a proxy to examine the traffic, but it turns out that the URL of sites in your LastPass account database are stored only as the hex-encoded ASCII string. No encryption whatsover. So LastPass can easily determine all of the sites that a user has accounts on. (This is genuinely surprising to me, but I triple-checked that this is actually the case.)

Future Work

I think it would be interesting to dump the entire blob in a readable format. There’s some interesting things in there, like equivalencies between multiple domains. (If an attacker could append one of those, they could get credentials for a legitimate domain sent to a domain they control.) I’d also like to poke at the extensions a little bit more, but reversing compiled JavaScript isn’t the most fun thing ever. :) (Suggestions of tools in this space would be welcome.)

One thing is important to understand: no evaluation can say for sure that LastPass can’t recover your passwords. Even if they’re doing everything right today, they could push a new version tomorrow (extensions are generally automatically updated) that records your master password. It’s inherent in the model of any browser extension-based password manager.


So, is Windows 10 Spying On You?

“Extraordinary claims require extraordinary evidence.”

A few days ago, localghost.org posted a translation of a Czech article alledging Windows 10 “phones home” in a number of ways. I was a little surprised, and more than a little alarmed, by some of the claims. Rather than blindly repost the claims, I decided it would be a good idea to see what I could test for myself. Rob Seder has done similarly but I’m taking it a step further to look at the real traffic contents.

Tools & Setup

I’m running the Windows 10 Insider Preview (which, admittedly, may not be the same as the release, but it’s what I had access to) in VirtualBox. The NIC on my Windows 10 VM is connected to an internal network to a Debian VM with my tools installed, which is in turn connected out to the internet. On the Debian VM, I’m using mitmproxy to perform a traffic MITM. I’ve also used VirtualBox’s network tracing to collect additional data.

Currently, I have all privacy settings set to the default, but I am not signed into a Microsoft Live account. This is an attempt to replicate the findings from the original article. At the moment, I’m only looking at HTTP/HTTPS traffic in detail, even though the original article wasn’t even specific enough to indicate what protocols were being used.

Claim 1. All text typed on the keyboard is sent to Microsoft

When typing into the search bar within the Start menu, an HTTPS request is sent after each character entered. Presumably this is to give web results along with local results, but the amount of additional metadata included is just mind-boggling. Here’s what the request for such a search looks like (some headers modified):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
GET /AS/API/WindowsCortanaPane/V2/Suggestions?qry=about&cp=5&cvid=ce8c2c3ad6704645bb207c0401d709aa&ig=7fdd08f6d6474ead86e3c71404e36dd6&cc=US&setlang=en-US HTTP/1.1
Accept:                        */*
X-BM-ClientFeatures:           FontV4, OemEnabled
X-Search-SafeSearch:           Moderate
X-Device-MachineId:            {73737373-9999-4444-9999-A8A8A8A8A8A8}
X-BM-Market:                   US
X-BM-DateFormat:               M/d/yyyy
X-Device-OSSKU:                48
X-Device-NetworkType:          ethernet
X-BM-DTZ:                      -420
X-BM-UserDisplayName:          Tester
X-DeviceID:                    0100D33317836214
X-BM-DeviceScale:              100
X-Device-Manufacturer:         innotek GmbH
X-BM-Theme:                    ffffff;005a9e
X-BM-DeviceDimensionsLogical:  320x622
X-BM-DeviceDimensions:         320x622
X-Device-Product:              VirtualBox
X-BM-CBT:                      1439740000
X-Device-isOptin:              false
X-Device-Touch:                false
X-AIS-AuthToken:               AISToken ApplicationId=25555555-ffff-4444-cccc-a7a7a7a7a7a7&ExpiresOn=1440301800&HMACSHA256=CS
                               y7XaNyyCE8oAZPeN%2b6IJ4ZrpqDDRZUIJyKvrIKnTA%3d
X-Device-ClientSession:        95290000000000000000000000000000
X-Search-AppId:                Microsoft.Windows.Cortana_cw5n1h2txyewy!CortanaUI
X-MSEdge-ExternalExpType:      JointCoord
X-MSEdge-ExternalExp:          sup001,pleasenosrm40ct,d-thshld42,d-thshld77,d-thshld78
Referer:                       https://www.bing.com/
Accept-Language:               en-US
Accept-Encoding:               gzip, deflate
User-Agent:                    Mozilla/5.0 (Windows NT 10.0; Win64; x64; Trident/7.0; rv:11.0; Cortana 1.4.8.152;
                               10.0.0.0.10240.21) like Gecko
Host:                          www.bing.com
Connection:                    Keep-Alive
Cookie:                        SA_SUPERFRESH_SUPPRESS=SUPPRESS=0&LAST=1439745358300; SRCHD=AF=NOFORM; ...

In addition to my query, “about”, it sends a “DeviceID”, a “MachineId”, the username I’m logged in as, the platform (VirtualBox), and a number of opaque identifiers in the query, the X-AIS-AuthToken, and the Cookies. That’s a lot of information just to give you search results.

Claim 2. Telemetry including file metadata is sent to Microsoft

I searched for several movie titles, including “Mission Impossible”, “Hackers”, and “Inside Out.” Other than the Cortana suggestions above, I didn’t see any traffic pertaining to these searches. Certainly, I didn’t see any evidence of uploading a list of multimedia files from my Windows 10 system, as described in the original post.

I also searched for a phone number in the edge browser, as described in the original post. (Specifically, I search for 867-5309.) The only related traffic I saw is traffic to the site on which I performed the search (yellowpages.com). No traffic containing that phone number went to any Microsoft-run server, as far as I can tell.

Claim 3. When a webcam is connected, 35MB of data gets sent

Nope. Not even close. I shut down the VM, started a new PCAP, restarted, and attached a webcam via USB forwarding in VirtualBox. After the drivers were fully installed, I shut down the system. The total size of the pcap was under 800k in size, a far cry from the claimed 35MB. Looking at mitmproxy and the pcap, the largest single connection was ~82kB in size. I have no idea what traffic he saw, but I saw no alarming connection related to plugging in a webcam. My best guess is maybe it’s actually 35MB of download, and his webcam required a driver download. (Admittedly a large driver, but I’ve seen bigger.)

Traffic from Connecting a Webcam

Claim 4. Everything said into a microphone is sent

Even when attempting to use the speech recognition in Windows, I saw nothing that was large enough to be audio spoken being transferred. Additionally, no intercepted HTTP or HTTPS traffic contained the raw words that I spoke to the voice recognition service. Maybe if signed in to Windows Live, Cortana performs uploads, but without being signed in, I saw nothing representative of the words I used with speech recognition.

Claim 5. Large volumes of data are uploaded when Windows is left unattended

I left Windows running for >1 hour while I went and had lunch. There were a small number of HTTP(s) requests, but they all seemed to be related to either updating the weather information displayed in the tiles or checking for new Windows updates. I don’t know what the OP considers “large volumes”, but I’m not seeing it.

Conclusion

The original post made some extraordinary claims, and I’m not seeing anything to the degree they claimed. To be sure, Windows 10 shares more data with Microsoft than I’d be comfortable with, particularly if Cortana is enabled, but it doesn’t seem to be anything like the levels described in the article. I wish the original poster had posted more about the type of traffic he was seeing, the specific requests, or even his methodology for testing.

The only dubious behavior I observed was sending every keystroke in the Windows Start Menu to the servers, but I understand that combined Computer/Web search is being sold as a feature, and this is necessary for that feature. I don’t know why all the metadata is needed, and it’s possibly excessive, but this isn’t the keylogger the original post claimed.

Unfortunately, it’s impossible to disprove his claims, but if it’s as bad as suggested, reproducing it should’ve been possible, and I’ve been unable to reproduce it. I encourage others to try it as well – if enough of us do it, it should be possible to either confirm or strongly refute the original claims.